Source code: https://github.com/djlofland/DS621_F2020_Group3/tree/master/Homework_4
Group 3 created a multiple linear regression and binary logistic model to estimate the probability of a driver having an auto accident, and the monetary damanage, for the customer Khansari Auto Insurance. As autoinsurance and insurance in general, stems off the pooling of individuals to mitigate risk, it’s important to be able to predict accident rates to ensure funds are available for disbursement.
Describe the size and the variables in the insurance training data set. Consider that too much detail will cause a manager to lose interest while too little detail will make the manager consider that you aren’t doing your job. Some suggestions are given below. Please do NOT treat this as a checklist of things to do to complete the assignment.
You should have your own thoughts on what to tell the boss. These are just ideas.
Given that the Index column had no impact on the target variable, it was dropped as part of the initial cleaning function. Additionally, the fields “INCOME”, “HOME_VAL”, “OLDCLAIM”, and, “BLUEBOOK”, were imported as characters with “$” leaders and were converted to numeric as part of the initial cleaning function. both the training and evaluation datasets will pass through this treatment.
Now, with initial cleaning complete, we can take a quick look at the dimensions of both the training and evaluation datasets:
Dimensions of training dataset
## [1] 8161 25
Dimensions of evaluation dataset
## [1] 2141 25
It looks like we have 8,161 cases and 25 variables in the training dataset and 2,141 cases and 25 variables in the evaluation dataset.
We can also provide a summary of each variable and it’s theoretical effect it’ll have on our models:
Variables of Interest
Next, we compiled summary statistics on our data set to better understand the data before modeling.
## TARGET_FLAG TARGET_AMT KIDSDRIV AGE HOMEKIDS
## 0:6008 Min. : 0 Min. :0.0000 Min. :16.00 Min. :0.0000
## 1:2153 1st Qu.: 0 1st Qu.:0.0000 1st Qu.:39.00 1st Qu.:0.0000
## Median : 0 Median :0.0000 Median :45.00 Median :0.0000
## Mean : 1504 Mean :0.1711 Mean :44.79 Mean :0.7212
## 3rd Qu.: 1036 3rd Qu.:0.0000 3rd Qu.:51.00 3rd Qu.:1.0000
## Max. :107586 Max. :4.0000 Max. :81.00 Max. :5.0000
## NA's :6
## YOJ INCOME PARENT1 HOME_VAL
## Min. : 0.0 Min. : 0 Length:8161 Min. : 0
## 1st Qu.: 9.0 1st Qu.: 28097 Class :character 1st Qu.: 0
## Median :11.0 Median : 54028 Mode :character Median :161160
## Mean :10.5 Mean : 61898 Mean :154867
## 3rd Qu.:13.0 3rd Qu.: 85986 3rd Qu.:238724
## Max. :23.0 Max. :367030 Max. :885282
## NA's :454 NA's :445 NA's :464
## MSTATUS SEX EDUCATION JOB
## Length:8161 Length:8161 Length:8161 Length:8161
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## TRAVTIME CAR_USE BLUEBOOK TIF
## Min. : 5.00 Length:8161 Min. : 1500 Min. : 1.000
## 1st Qu.: 22.00 Class :character 1st Qu.: 9280 1st Qu.: 1.000
## Median : 33.00 Mode :character Median :14440 Median : 4.000
## Mean : 33.49 Mean :15710 Mean : 5.351
## 3rd Qu.: 44.00 3rd Qu.:20850 3rd Qu.: 7.000
## Max. :142.00 Max. :69740 Max. :25.000
##
## CAR_TYPE RED_CAR OLDCLAIM CLM_FREQ
## Length:8161 Length:8161 Min. : 0 Min. :0.0000
## Class :character Class :character 1st Qu.: 0 1st Qu.:0.0000
## Mode :character Mode :character Median : 0 Median :0.0000
## Mean : 4037 Mean :0.7986
## 3rd Qu.: 4636 3rd Qu.:2.0000
## Max. :57037 Max. :5.0000
##
## REVOKED MVR_PTS CAR_AGE URBANICITY
## Length:8161 Min. : 0.000 Min. :-3.000 Length:8161
## Class :character 1st Qu.: 0.000 1st Qu.: 1.000 Class :character
## Mode :character Median : 1.000 Median : 8.000 Mode :character
## Mean : 1.696 Mean : 8.328
## 3rd Qu.: 3.000 3rd Qu.:12.000
## Max. :13.000 Max. :28.000
## NA's :510
Next, we wanted to get an idea of the distribution profiles for each of the variables. We have two target values, 0 and 1. When building models, we ideally want an equal representation of both classes. As class imbalance deviates, our model performance will suffer both form effects of differential variance between the classes and bias towards picking the more represented class. For logistic regression, if we see a strong imbalance, we can 1) up-sample the smaller group, down-sample the larger group, or adjust our threshold for assigning the predicted value away from 0.5.
| Var1 | Freq |
|---|---|
| 0 | 0.7361843 |
| 1 | 0.2638157 |
The classes are not perfectly balanced, with approximately 73.6% 0’s and 26.4% 1’s. With unbalanced class distributions, it is often necessary to artificially balance the classes to achieve good results. Up-sampling or Down-sampling may be required to achieve class balance with this dataset. We will evaluate model performance accordingly.
Next, we visualize the distribution profiles for each of the predictor variables. This will help us to make a plan on which variable to include, how they might be related to each other or the target, and finally identify outliers or transformations that might help improve model resolution.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
The distribution profiles show the prevalence of kurtosis, specifically right skew in variables TRAVTIME, OLDCLAIM, MVR_PTS, TARGET_AMT, INCOME, BLUEBOOK, and approximately normal distributions in YOJ, CARAGE, HOME_VAL, and AGE. when deviations are skewed from traditional normal distribution, this can be problematic for regression assumptions, and thus we might need to transform the data. Under logistic regression, we will need to dummy factor-based variables for the model to understand the data.
While we don’t tackle feature engineering in this analysis, if we were performing a more in-depth analysis, we could leverage the package, mixtools (see R Vignette). This package helps regress mixed models where data can be subdivided into subgroups.
Lastly, several features have both a distribution along with a high number of values at an extreme. However, based on the feature meanings and provided information, there is no reason to believe that any of these extreme values are mistakes, data errors, or otherwise inexplicable. As such, we will not remove the extreme values, as they represent valuable data and could be predictive of the target.
In addition to creating histogram distributions, we also elected to use box-plots to get an idea of the spread of the response variable TARGET_AMT in relation to all of the non-numeric variables. Two sets of boxplots are shown below due to the wide distribution of the response variable. The first set of boxplots highlights the entire range and shows how the cost of car crashes peaks relative to the specific category.
The second set of box plots simply shows these same distributions “zoomed in” by adjusting the axis to allow for a visual of the interquartile range of the response variable relative to each of the categorical predictors.
We wanted to plot scatter plots of each variable versus the target variable to get an idea of the relationship between them. There are some notable trends as observed in the scatterplots below such as our response variable TARGET_AMT is likely to be lower when individuals have more kids at home as indicated by the HOMEKIDS feature, and when they have more teenagers driving the car indicated by the feature KIDSDRIV.
Additionally a pairwise comparison plot between all features, both numeric and non-numeric is shown following the scatterplot where this initially implies that there aren’t a significant amount of correlated features and this can give some insight into the expected significance and performing dimensionality reduction on the datasets for the models.
Finally, we can observe the sparsity of information within our dataset by using the DataExplorer package to assess missing information.
We can see that generally, our dataset is in good shape, however, some imputation may be needed for INCOME, YOJ, HOME_VAL, and CAR_AGE.
To summarize our data preparation and exploration, we can distinguish our findings into a few categories below:
All the predictor variables have a low enough percentage of missing values that they can be imputed and still provide useful information to the model. As such, we chose to keep all the fields.
Missing values will be imputed with the preprocess function in the caret package
imputation <- preProcess(df, method = "knnImpute")
predict_imputer <- predict(imputation, df)
#compare to above chart
missing2 <- predict_imputer %>%
miss_var_summary()
kable(missing2 %>%
filter(pct_miss > 0)) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))| variable | n_miss | pct_miss |
|---|---|---|
No outliers were removed as all values seemed reasonable.
Finally, as mentioned earlier in our data exploration, and our findings from our histogram plots, we can see that some of our variables are highly skewed. To address this, we decided to perform some transformations to make them more normally distributed. Here are some plots to demonstrate the changes in distributions before and after the transformations:
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
As we can see from above, our transformations helped to alleviate some of the kurtosis in the distributions that were highly skewed. Now, we can see that
AGE_transform, BLUEBOOK_transform, CAR_AGE_transform, HOME_VAL_transform, INCOME_transform, TIF_transform, and TRAVTIME_transform are all the result of Boxcox transformations on the original variables. Additionally, TARGET_AMT_transform and OLDCLAIM_transform are the result of log transformations.
Using the training data, build at least two different multiple linear regression models and three different binary logistic models, using different variables (or the same variables with different transformations). You may select the variables manually, use an approach such as Forward or Stepwise, use a different approach, or use a combination of techniques. Describe the techniques you used. If you manually selected a variable for inclusion into the model or exclusion into the model, indicate why this was done.
Be sure to explain how you can make inferences from the model, as well as discuss other relevant model output. Discuss the coefficients in the models, do they make sense? Are you keeping the model even though it is counter-intuitive? Why? The boss needs to know.
The first model built was based on the entire dataset. The linear models will be attempting to predict the feature “TARGET_AMT” using the existing features within the dataset except for “TARGET_FLAG”. these features are to be considered unknown.
First, we decided to split our cleaned dataset into a training and testing set (75% training, 25% testing). This was necessary as the provided holdout evaluation dataset doesn’t provide target values so we cannot measure our model performance against that dataset. This split was used on all models
First linear regression model, no transformations, all numeric predictors included
##
## Call:
## stats::lm(formula = TARGET_AMT ~ AGE + BLUEBOOK + CAR_AGE + CLM_FREQ +
## HOME_VAL + HOMEKIDS + INCOME + KIDSDRIV + MVR_PTS + OLDCLAIM +
## TIF + TRAVTIME + YOJ + CAR_TYPE + CAR_USE + EDUCATION + JOB +
## MSTATUS + PARENT1 + RED_CAR + REVOKED + SEX + URBANICITY,
## data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5586 -1697 -789 319 103372
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.196e+03 6.610e+02 1.809 0.07046 .
## AGE 4.701e+00 8.397e+00 0.560 0.57562
## BLUEBOOK 1.192e-02 1.037e-02 1.150 0.25027
## CAR_AGE -4.595e+01 1.519e+01 -3.025 0.00250 **
## CLM_FREQ 1.216e+02 6.545e+01 1.858 0.06318 .
## HOME_VAL -1.122e-03 7.249e-04 -1.547 0.12190
## HOMEKIDS 8.651e+01 7.783e+01 1.112 0.26639
## INCOME -3.342e-03 2.268e-03 -1.474 0.14064
## KIDSDRIV 2.970e+02 1.350e+02 2.199 0.02791 *
## MVR_PTS 1.796e+02 3.091e+01 5.809 6.59e-09 ***
## OLDCLAIM -1.154e-02 8.826e-03 -1.308 0.19105
## TIF -5.958e+01 1.454e+01 -4.097 4.24e-05 ***
## TRAVTIME 1.116e+01 3.831e+00 2.912 0.00360 **
## YOJ -1.435e+01 1.792e+01 -0.801 0.42319
## CAR_TYPEPanel Truck 1.829e+02 3.302e+02 0.554 0.57967
## CAR_TYPEPickup 2.429e+02 2.025e+02 1.200 0.23026
## CAR_TYPESports Car 1.053e+03 2.594e+02 4.059 4.98e-05 ***
## CAR_TYPEVan 7.123e+02 2.518e+02 2.829 0.00468 **
## CAR_TYPEz_SUV 6.835e+02 2.146e+02 3.185 0.00145 **
## CAR_USEPrivate -7.580e+02 1.942e+02 -3.904 9.56e-05 ***
## EDUCATIONBachelors 2.072e+00 2.426e+02 0.009 0.99319
## EDUCATIONMasters 3.645e+02 3.570e+02 1.021 0.30724
## EDUCATIONPhD 6.416e+02 4.219e+02 1.521 0.12842
## EDUCATIONz_High School 1.805e+01 2.045e+02 0.088 0.92966
## JOBClerical 6.944e+02 4.050e+02 1.715 0.08645 .
## JOBDoctor -3.627e+02 4.833e+02 -0.751 0.45297
## JOBHome Maker 4.854e+02 4.339e+02 1.119 0.26329
## JOBLawyer 2.991e+02 3.513e+02 0.851 0.39454
## JOBManager -2.915e+02 3.422e+02 -0.852 0.39434
## JOBProfessional 6.865e+02 3.658e+02 1.877 0.06058 .
## JOBStudent 4.244e+02 4.461e+02 0.951 0.34144
## JOBz_Blue Collar 8.108e+02 3.815e+02 2.125 0.03362 *
## MSTATUSz_No 5.423e+02 1.747e+02 3.104 0.00191 **
## PARENT1Yes 3.826e+02 2.400e+02 1.594 0.11100
## RED_CARyes -1.143e+02 1.786e+02 -0.640 0.52221
## REVOKEDYes 4.836e+02 2.052e+02 2.357 0.01843 *
## SEXz_F -3.626e+02 2.194e+02 -1.653 0.09846 .
## URBANICITYz_Highly Rural/ Rural -1.656e+03 1.658e+02 -9.989 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4675 on 6083 degrees of freedom
## Multiple R-squared: 0.06739, Adjusted R-squared: 0.06172
## F-statistic: 11.88 on 37 and 6083 DF, p-value: < 2.2e-16
Results of the first model show that using all non-transformed data doesnt fully satisfy the assumptions for a linear regression model. We can also see that are several predictors that show insignificant correlations but before we discount any features, we will use the transformed the data to satisfy the conditions of linear regression. Our R2 was 0.06 and our error was 4675 which is indicative of a very poor model.
The second model built was based on the entire dataset as well but various features were transformed as discussed in the Data Preparation section of this assignment. with our transformed variables via boxcox and linear transformations, linear regression was applied to the remaining data to assess predictability of our response variable.
##
## Call:
## stats::lm(formula = TARGET_AMT_transform ~ ., data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.8072 -2.3239 -0.9094 1.9629 10.4832
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.6525701 0.5896527 6.194 6.23e-10 ***
## AGE 0.0022763 0.0058109 0.392 0.695268
## BLUEBOOK_transform -0.0450465 0.0135550 -3.323 0.000895 ***
## CAR_AGE_transform -0.0436398 0.0260314 -1.676 0.093706 .
## CAR_TYPEPanel Truck 0.2696825 0.2176776 1.239 0.215427
## CAR_TYPEPickup 0.5349225 0.1397707 3.827 0.000131 ***
## CAR_TYPESports Car 1.0611065 0.1783942 5.948 2.86e-09 ***
## CAR_TYPEVan 0.6540534 0.1734607 3.771 0.000164 ***
## CAR_TYPEz_SUV 0.7966993 0.1452430 5.485 4.29e-08 ***
## CAR_USEPrivate -0.9603331 0.1341355 -7.159 9.06e-13 ***
## CLM_FREQ 0.1265979 0.0706179 1.793 0.073068 .
## EDUCATIONBachelors -0.2608015 0.1689184 -1.544 0.122653
## EDUCATIONMasters -0.0405783 0.2415543 -0.168 0.866598
## EDUCATIONPhD 0.0610753 0.2802850 0.218 0.827511
## EDUCATIONz_High School 0.1607296 0.1416156 1.135 0.256432
## HOME_VAL_transform -0.0004504 0.0001237 -3.642 0.000272 ***
## HOMEKIDS 0.0768615 0.0540685 1.422 0.155206
## INCOME_transform -0.0149842 0.0040164 -3.731 0.000193 ***
## JOBClerical 0.7530586 0.2775453 2.713 0.006681 **
## JOBDoctor -0.2722778 0.3337705 -0.816 0.414667
## JOBHome Maker 0.2315929 0.3150573 0.735 0.462318
## JOBLawyer 0.2394928 0.2422125 0.989 0.322814
## JOBManager -0.4610222 0.2360851 -1.953 0.050891 .
## JOBProfessional 0.4794468 0.2521012 1.902 0.057244 .
## JOBStudent 0.1418251 0.3272668 0.433 0.664767
## JOBz_Blue Collar 0.7664547 0.2625805 2.919 0.003525 **
## KIDSDRIV 0.5033165 0.0933104 5.394 7.15e-08 ***
## MSTATUSz_No 0.5541900 0.1238162 4.476 7.75e-06 ***
## MVR_PTS 0.1803303 0.0219374 8.220 2.46e-16 ***
## OLDCLAIM_transform 0.0152553 0.0198498 0.769 0.442198
## PARENT1Yes 0.6159434 0.1655566 3.720 0.000201 ***
## RED_CARyes -0.0485750 0.1233161 -0.394 0.693664
## REVOKEDYes 1.0131055 0.1286314 7.876 3.97e-15 ***
## SEXz_F -0.1151818 0.1485014 -0.776 0.437999
## TIF_transform -0.3134415 0.0387586 -8.087 7.32e-16 ***
## TRAVTIME_transform 0.0983533 0.0165498 5.943 2.96e-09 ***
## URBANICITYz_Highly Rural/ Rural -2.3793240 0.1150897 -20.674 < 2e-16 ***
## YOJ -0.0049852 0.0142641 -0.349 0.726729
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.229 on 6083 degrees of freedom
## Multiple R-squared: 0.2267, Adjusted R-squared: 0.222
## F-statistic: 48.21 on 37 and 6083 DF, p-value: < 2.2e-16
Results of this model show that we satisfy the requirements of linear regression with residuals scattered around 0 and a normal-like Q-Q plot. There are still several non-significant variables such as OLDCLAIM, SEXz_F, YOJ, JOBDoctor, JOBLawyer, JOBStudent CAR_TYPE. our r2 for this model is 0.22 with a standard error of 3.2, better than model 1.
This third model builds upon attempts to perform a linear regression by removing select features that are insignificant to the model created in model 2. in addition to feature selection, using the tidymodels framework, we removed any features that are highly correlated with each other as well as any zero-variance features. Only numeric features were considered in this model.
##
## Call:
## stats::lm(formula = TARGET_AMT_transform ~ ., data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.875 -2.296 -1.261 2.823 10.407
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.593e+00 3.656e-01 9.826 < 2e-16 ***
## BLUEBOOK_transform -4.961e-02 1.099e-02 -4.514 6.49e-06 ***
## CAR_AGE_transform -1.095e-01 2.121e-02 -5.162 2.52e-07 ***
## HOME_VAL_transform -9.218e-04 9.478e-05 -9.726 < 2e-16 ***
## HOMEKIDS 1.564e-01 4.534e-02 3.450 0.000564 ***
## KIDSDRIV 4.500e-01 9.747e-02 4.617 3.97e-06 ***
## MVR_PTS 2.262e-01 2.317e-02 9.764 < 2e-16 ***
## OLDCLAIM_transform 1.235e-01 1.151e-02 10.723 < 2e-16 ***
## TIF_transform -3.069e-01 4.112e-02 -7.462 9.68e-14 ***
## TRAVTIME_transform 5.690e-02 1.733e-02 3.283 0.001034 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.436 on 6111 degrees of freedom
## Multiple R-squared: 0.1208, Adjusted R-squared: 0.1195
## F-statistic: 93.28 on 9 and 6111 DF, p-value: < 2.2e-16
## Selecting by Overall
## Selecting by Overall
## Selecting by Overall
This model did not perform as well linear model 2. This is likely due to the omission of features that were not numeric. The R2 for this model was .12 with a rmse of 3.43
The next set of models built are logistic regression models. these predict the outcome of non-numeric variables. this is done by assessing the probability that that a particular class will occur. In this case, we are doing a binary classification of the feature “TARGET_FLAG” where the outcome is 0 or 1.
Similar to the linear regression models, the first model built was based on the entire dataset. The logistic models will be attempting to predict the feature “TARGET_FLAG” using the existing features within the dataset except for “TARGET_AMT”. These features are to be considered unknown.
##
## Call:
## stats::glm(formula = TARGET_FLAG ~ BLUEBOOK + CAR_AGE + HOME_VAL +
## INCOME + PARENT1 + HOME_VAL + MSTATUS + SEX + EDUCATION +
## JOB + TRAVTIME + CAR_USE + KIDSDRIV + AGE + HOMEKIDS + YOJ +
## CAR_TYPE + RED_CAR + OLDCLAIM + CLM_FREQ + REVOKED + MVR_PTS +
## CAR_AGE + URBANICITY, family = stats::binomial, data = data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.4579 -0.7228 -0.4096 0.6134 2.9266
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.365e+00 3.650e-01 -3.739 0.000184 ***
## BLUEBOOK -2.021e-05 6.103e-06 -3.311 0.000930 ***
## CAR_AGE -1.335e-02 8.635e-03 -1.546 0.122161
## HOME_VAL -1.462e-06 4.007e-07 -3.648 0.000265 ***
## INCOME -2.416e-06 1.282e-06 -1.885 0.059487 .
## PARENT1Yes 3.827e-01 1.258e-01 3.042 0.002348 **
## MSTATUSz_No 4.646e-01 9.715e-02 4.783 1.73e-06 ***
## SEXz_F -6.515e-02 1.284e-01 -0.508 0.611789
## EDUCATIONBachelors -2.591e-01 1.318e-01 -1.967 0.049233 *
## EDUCATIONMasters -1.289e-01 2.043e-01 -0.631 0.528268
## EDUCATIONPhD 6.120e-02 2.420e-01 0.253 0.800332
## EDUCATIONz_High School 6.478e-02 1.092e-01 0.593 0.552861
## JOBClerical 4.823e-01 2.264e-01 2.130 0.033133 *
## JOBDoctor -3.475e-01 3.005e-01 -1.157 0.247454
## JOBHome Maker 2.940e-01 2.412e-01 1.219 0.222907
## JOBLawyer 1.357e-01 1.983e-01 0.684 0.493786
## JOBManager -4.817e-01 1.989e-01 -2.422 0.015436 *
## JOBProfessional 2.325e-01 2.062e-01 1.128 0.259475
## JOBStudent 2.845e-01 2.477e-01 1.149 0.250735
## JOBz_Blue Collar 4.082e-01 2.131e-01 1.916 0.055399 .
## TRAVTIME 1.346e-02 2.154e-03 6.247 4.17e-10 ***
## CAR_USEPrivate -7.212e-01 1.049e-01 -6.878 6.08e-12 ***
## KIDSDRIV 3.824e-01 7.017e-02 5.450 5.04e-08 ***
## AGE 2.761e-03 4.594e-03 0.601 0.547887
## HOMEKIDS 7.568e-02 4.260e-02 1.777 0.075613 .
## YOJ -2.135e-02 9.811e-03 -2.176 0.029533 *
## CAR_TYPEPanel Truck 4.137e-01 1.868e-01 2.214 0.026801 *
## CAR_TYPEPickup 4.970e-01 1.150e-01 4.321 1.55e-05 ***
## CAR_TYPESports Car 9.410e-01 1.483e-01 6.344 2.24e-10 ***
## CAR_TYPEVan 6.382e-01 1.440e-01 4.433 9.29e-06 ***
## CAR_TYPEz_SUV 6.840e-01 1.276e-01 5.360 8.30e-08 ***
## RED_CARyes -2.742e-02 1.003e-01 -0.273 0.784523
## OLDCLAIM -1.256e-05 4.468e-06 -2.812 0.004927 **
## CLM_FREQ 1.808e-01 3.295e-02 5.489 4.04e-08 ***
## REVOKEDYes 8.748e-01 1.040e-01 8.409 < 2e-16 ***
## MVR_PTS 1.241e-01 1.573e-02 7.889 3.04e-15 ***
## URBANICITYz_Highly Rural/ Rural -2.292e+00 1.280e-01 -17.912 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 7037.3 on 6120 degrees of freedom
## Residual deviance: 5529.6 on 6084 degrees of freedom
## AIC: 5603.6
##
## Number of Fisher Scoring iterations: 5
The results of the first Logistic regression model on the untransformed data shows an AIC of 5486 with several insignificant predictors. we can use this information on the P-values to assess which features can be removed from the model
In the the second logistic regression model, we remove variables will low variance, as well as variables that may exhibit any colinearity. Additionally, 4 numeric features from our training set were binned into groups and used as predictors in lieu of numeric based predictors. The plot below shows the 4 features that were chosen to be binned within the analysis.
##
## Call:
## stats::glm(formula = TARGET_FLAG ~ ., family = stats::binomial,
## data = data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.3281 -0.6980 -0.3924 0.5596 2.8649
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.366e+01 5.354e+02 0.026 0.979645
## AGE(36,56] -2.731e-01 9.812e-02 -2.783 0.005379 **
## AGE(56,76] 6.111e-01 1.473e-01 4.147 3.36e-05 ***
## AGE(76,80] -1.030e+01 5.354e+02 -0.019 0.984652
## BLUEBOOK_transform(25.3,33.8] -3.900e-01 9.400e-02 -4.149 3.35e-05 ***
## BLUEBOOK_transform(33.8,42.3] -7.578e-01 1.612e-01 -4.701 2.60e-06 ***
## BLUEBOOK_transform(42.3,42.5] -1.236e+01 5.354e+02 -0.023 0.981587
## CAR_AGE_transform(-3.53,-1.47] -2.494e+01 6.127e+02 -0.041 0.967532
## CAR_AGE_transform(-1.47,0.601] -1.373e+01 5.354e+02 -0.026 0.979539
## CAR_AGE_transform(0.601,2.67] -1.382e+01 5.354e+02 -0.026 0.979400
## CAR_AGE_transform(2.67,4.73] -1.382e+01 5.354e+02 -0.026 0.979406
## CAR_AGE_transform(4.73,6.8] -1.396e+01 5.354e+02 -0.026 0.979206
## CAR_AGE_transform(6.8,8.27] -1.458e+01 5.354e+02 -0.027 0.978268
## CAR_TYPEPanel Truck 4.107e-01 1.842e-01 2.229 0.025817 *
## CAR_TYPEPickup 5.367e-01 1.168e-01 4.596 4.32e-06 ***
## CAR_TYPESports Car 8.421e-01 1.450e-01 5.809 6.29e-09 ***
## CAR_TYPEVan 6.090e-01 1.420e-01 4.287 1.81e-05 ***
## CAR_TYPEz_SUV 6.551e-01 1.224e-01 5.351 8.76e-08 ***
## CAR_USEPrivate -7.280e-01 1.065e-01 -6.834 8.27e-12 ***
## CLM_FREQ 8.306e-02 5.120e-02 1.622 0.104750
## EDUCATIONBachelors -2.377e-01 1.343e-01 -1.770 0.076775 .
## EDUCATIONMasters -1.111e-01 2.062e-01 -0.539 0.589842
## EDUCATIONPhD -1.231e-02 2.383e-01 -0.052 0.958792
## EDUCATIONz_High School 7.617e-02 1.123e-01 0.678 0.497690
## HOME_VAL_transform -3.941e-04 9.860e-05 -3.997 6.41e-05 ***
## HOMEKIDS 4.233e-02 4.299e-02 0.985 0.324831
## INCOME_transform -1.242e-02 3.261e-03 -3.810 0.000139 ***
## JOBClerical 4.536e-01 2.283e-01 1.987 0.046921 *
## JOBDoctor -4.580e-01 3.031e-01 -1.511 0.130741
## JOBHome Maker -5.740e-02 2.617e-01 -0.219 0.826377
## JOBLawyer 7.345e-02 2.004e-01 0.367 0.713956
## JOBManager -5.383e-01 2.005e-01 -2.685 0.007260 **
## JOBProfessional 2.091e-01 2.087e-01 1.002 0.316428
## JOBStudent -1.693e-01 2.681e-01 -0.631 0.527756
## JOBz_Blue Collar 4.076e-01 2.158e-01 1.889 0.058862 .
## KIDSDRIV 4.718e-01 7.438e-02 6.343 2.25e-10 ***
## MSTATUSz_No 4.923e-01 1.018e-01 4.836 1.33e-06 ***
## MVR_PTS 1.019e-01 1.651e-02 6.171 6.80e-10 ***
## OLDCLAIM_transform 1.957e-02 1.460e-02 1.340 0.180259
## PARENT1Yes 3.360e-01 1.275e-01 2.636 0.008392 **
## RED_CARyes -3.336e-02 1.021e-01 -0.327 0.743885
## REVOKEDYes 6.809e-01 9.506e-02 7.163 7.92e-13 ***
## SEXz_F -2.551e-02 1.235e-01 -0.206 0.836440
## TIF_transform -2.408e-01 3.181e-02 -7.572 3.68e-14 ***
## TRAVTIME_transform(6.04,9.7] 2.715e-01 9.675e-02 2.806 0.005015 **
## TRAVTIME_transform(9.7,13.4] 6.150e-01 1.039e-01 5.917 3.28e-09 ***
## TRAVTIME_transform(13.4,17] 1.988e-01 3.811e-01 0.522 0.601996
## TRAVTIME_transform(17,18.3] -9.893e+00 3.637e+02 -0.027 0.978298
## URBANICITYz_Highly Rural/ Rural -2.284e+00 1.289e-01 -17.721 < 2e-16 ***
## YOJ -4.063e-03 1.162e-02 -0.350 0.726558
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 7037.3 on 6120 degrees of freedom
## Residual deviance: 5386.6 on 6071 degrees of freedom
## AIC: 5486.6
##
## Number of Fisher Scoring iterations: 12
With this the logistic model was run and the results show and AIC 5486. lower than the first model and implies that this is better at predicting the classification for TARGET_FLAG.
One of the measures to consider when running classification models is the distribution of the binary classification. In this case, the dataset innately has a biased classification, at 0.73% of values showing 0, and 27% of values showing 1.
##
## 0 1
## 0.7361843 0.2638157
Because of this class imbalance, if we were to simply guess that all values are 0, that would technically lead to a 73% accurate model. To account for this, we will downsample the majority class in the final logistic model. This downsampling will allow the distribution of the training set to be 50/50. the testing dataset will remain as is.
##
## 0 1
## 0.5 0.5
##
## Call:
## stats::glm(formula = TARGET_FLAG ~ ., family = stats::binomial,
## data = data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.82953 -0.83526 0.00015 0.82576 2.50218
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.378e+01 5.354e+02 0.026 0.979460
## AGE(36,56] -2.449e-01 1.276e-01 -1.919 0.054974 .
## AGE(56,76] 7.372e-01 1.942e-01 3.796 0.000147 ***
## BLUEBOOK_transform(25.3,33.8] -3.761e-01 1.208e-01 -3.115 0.001841 **
## BLUEBOOK_transform(33.8,42.3] -6.091e-01 2.048e-01 -2.974 0.002938 **
## CAR_AGE_transform(-3.53,-1.47] -2.436e+01 6.527e+02 -0.037 0.970231
## CAR_AGE_transform(-1.47,0.601] -1.285e+01 5.354e+02 -0.024 0.980847
## CAR_AGE_transform(0.601,2.67] -1.300e+01 5.354e+02 -0.024 0.980634
## CAR_AGE_transform(2.67,4.73] -1.289e+01 5.354e+02 -0.024 0.980786
## CAR_AGE_transform(4.73,6.8] -1.294e+01 5.354e+02 -0.024 0.980713
## CAR_AGE_transform(6.8,8.27] -1.358e+01 5.354e+02 -0.025 0.979771
## CAR_TYPEPanel Truck 2.423e-01 2.348e-01 1.032 0.302085
## CAR_TYPEPickup 4.567e-01 1.466e-01 3.114 0.001843 **
## CAR_TYPESports Car 9.331e-01 1.826e-01 5.111 3.21e-07 ***
## CAR_TYPEVan 6.007e-01 1.763e-01 3.407 0.000656 ***
## CAR_TYPEz_SUV 8.021e-01 1.522e-01 5.271 1.36e-07 ***
## CAR_USEPrivate -8.733e-01 1.372e-01 -6.363 1.97e-10 ***
## CLM_FREQ 4.885e-02 6.803e-02 0.718 0.472684
## EDUCATIONBachelors -1.693e-01 1.698e-01 -0.997 0.318597
## EDUCATIONMasters 1.281e-01 2.597e-01 0.493 0.621944
## EDUCATIONPhD 1.612e-01 3.065e-01 0.526 0.598818
## EDUCATIONz_High School 2.225e-01 1.431e-01 1.554 0.120126
## HOME_VAL_transform -4.454e-04 1.260e-04 -3.536 0.000406 ***
## HOMEKIDS 4.943e-02 5.482e-02 0.902 0.367285
## INCOME_transform -1.497e-02 4.155e-03 -3.602 0.000316 ***
## JOBClerical 6.519e-01 2.832e-01 2.302 0.021346 *
## JOBDoctor -5.432e-01 3.608e-01 -1.505 0.132216
## JOBHome Maker 1.465e-01 3.209e-01 0.457 0.647987
## JOBLawyer -1.019e-02 2.452e-01 -0.042 0.966858
## JOBManager -3.502e-01 2.443e-01 -1.434 0.151701
## JOBProfessional 4.409e-01 2.617e-01 1.685 0.091998 .
## JOBStudent 7.238e-02 3.339e-01 0.217 0.828374
## JOBz_Blue Collar 5.588e-01 2.698e-01 2.071 0.038381 *
## KIDSDRIV 4.812e-01 1.004e-01 4.795 1.63e-06 ***
## MSTATUSz_No 6.108e-01 1.279e-01 4.775 1.80e-06 ***
## MVR_PTS 1.097e-01 2.192e-02 5.006 5.57e-07 ***
## OLDCLAIM_transform 3.433e-02 1.920e-02 1.788 0.073721 .
## PARENT1Yes 3.218e-01 1.660e-01 1.938 0.052587 .
## RED_CARyes -4.719e-02 1.295e-01 -0.364 0.715608
## REVOKEDYes 6.927e-01 1.284e-01 5.396 6.82e-08 ***
## SEXz_F -1.323e-01 1.556e-01 -0.850 0.395267
## TIF_transform -2.958e-01 4.089e-02 -7.235 4.67e-13 ***
## TRAVTIME_transform(6.04,9.7] 2.548e-01 1.221e-01 2.086 0.036937 *
## TRAVTIME_transform(9.7,13.4] 5.980e-01 1.318e-01 4.537 5.71e-06 ***
## TRAVTIME_transform(13.4,17] 3.339e-01 4.397e-01 0.759 0.447682
## URBANICITYz_Highly Rural/ Rural -2.310e+00 1.496e-01 -15.442 < 2e-16 ***
## YOJ -6.071e-03 1.495e-02 -0.406 0.684613
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 4441.7 on 3203 degrees of freedom
## Residual deviance: 3256.3 on 3157 degrees of freedom
## AIC: 3350.3
##
## Number of Fisher Scoring iterations: 12
## Selecting by Overall
## Selecting by Overall
## Selecting by Overall
With this we end up with a model that has an AIC score of 3350, significantly less than that of the first two logistic models and thus we can say that this model has performed better. additional metrics on each model is presented in the Model Performance Summary at the end of this document.
Finally, to tune the linear regression model, we added the predictions of made from our logistic model to obtain the feature TARGET_FLAG, with this, an additional predictor feature was added to the model and could be used.
##
## Call:
## stats::lm(formula = TARGET_AMT_transform ~ ., data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.727 -1.984 -0.934 1.376 10.497
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.246e+00 6.105e-01 2.040 0.041354 *
## AGE -2.257e-03 5.743e-03 -0.393 0.694378
## BLUEBOOK_transform -3.672e-02 1.339e-02 -2.742 0.006116 **
## CAR_AGE_transform -3.407e-02 2.569e-02 -1.326 0.184855
## CAR_TYPEPanel Truck 2.849e-01 2.147e-01 1.326 0.184727
## CAR_TYPEPickup 4.113e-01 1.382e-01 2.976 0.002932 **
## CAR_TYPESports Car 7.874e-01 1.772e-01 4.442 9.06e-06 ***
## CAR_TYPEVan 5.357e-01 1.714e-01 3.126 0.001780 **
## CAR_TYPEz_SUV 5.735e-01 1.443e-01 3.974 7.15e-05 ***
## CAR_USEPrivate -6.676e-01 1.342e-01 -4.973 6.76e-07 ***
## CLM_FREQ 7.363e-02 6.978e-02 1.055 0.291406
## EDUCATIONBachelors -2.035e-01 1.667e-01 -1.221 0.222189
## EDUCATIONMasters -6.280e-02 2.383e-01 -0.264 0.792147
## EDUCATIONPhD 2.707e-02 2.765e-01 0.098 0.922005
## EDUCATIONz_High School 9.169e-02 1.398e-01 0.656 0.511972
## HOME_VAL_transform -3.179e-04 1.224e-04 -2.597 0.009425 **
## HOMEKIDS 5.050e-02 5.338e-02 0.946 0.344149
## INCOME_transform -9.809e-03 3.982e-03 -2.463 0.013799 *
## JOBClerical 4.233e-01 2.750e-01 1.539 0.123779
## JOBDoctor -3.403e-01 3.293e-01 -1.033 0.301539
## JOBHome Maker 1.046e-01 3.110e-01 0.336 0.736672
## JOBLawyer 1.100e-01 2.392e-01 0.460 0.645415
## JOBManager -4.648e-01 2.329e-01 -1.996 0.045990 *
## JOBProfessional 2.828e-01 2.492e-01 1.135 0.256480
## JOBStudent 1.505e-03 3.230e-01 0.005 0.996283
## JOBz_Blue Collar 4.361e-01 2.603e-01 1.675 0.093914 .
## KIDSDRIV 3.460e-01 9.285e-02 3.727 0.000196 ***
## MSTATUSz_No 4.229e-01 1.226e-01 3.450 0.000564 ***
## MVR_PTS 1.088e-01 2.233e-02 4.872 1.13e-06 ***
## OLDCLAIM_transform 2.079e-02 1.959e-02 1.062 0.288441
## PARENT1Yes 3.327e-01 1.648e-01 2.019 0.043492 *
## RED_CARyes -1.425e-02 1.217e-01 -0.117 0.906781
## REVOKEDYes 6.398e-01 1.301e-01 4.917 9.00e-07 ***
## SEXz_F -4.570e-02 1.466e-01 -0.312 0.755234
## TIF_transform -2.154e-01 3.897e-02 -5.528 3.38e-08 ***
## TRAVTIME_transform 7.394e-02 1.643e-02 4.499 6.96e-06 ***
## URBANICITYz_Highly Rural/ Rural -1.861e+00 1.204e-01 -15.463 < 2e-16 ***
## YOJ 2.166e-05 1.408e-02 0.002 0.998773
## preds 1.954e+00 1.505e-01 12.980 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.186 on 6082 degrees of freedom
## Multiple R-squared: 0.2476, Adjusted R-squared: 0.2429
## F-statistic: 52.67 on 38 and 6082 DF, p-value: < 2.2e-16
## Selecting by Overall
After using the results of the optimal logistic regression model to predict TARGET_FLAG and passing that into a linear regression, the R2 for the model has bumped up to 0.2476 with a standard error of 3.186.
Finally, we can view the results of each of the linear regression models. the results for both the training and testing dataset s are plotted below. there are many datapoints where TARGET_FLAG is 0 and thus leads to regression plots that may be weighted accordingly. Omitting points with 0 does not improve the model.
## `geom_smooth()` using formula 'y ~ x'
The results for both the training and testing datasets on each of the models is presented below. The linear model that performed the best on this dataset to predict our response variable TARGET_AMT is linear model 4. this model uses all predictor variables with transformations done to normalize numeric variables that did not follow gaussian distributions. Additionally, the predictor variable TARGET_FLAG was included in this analysis via prediction from the logistic regression model. Although this model may be the best of the four, it may not be sufficient in real life as the rsq implies that the model explains about 24% of the variance within the dataset. The results were consistent between the testing and the training datasets implying that no minimal overfitting/underfitting occured.
| model | set | rmse | rsq | mae |
|---|---|---|---|---|
| LinearModel1 | Training Set | 4660.736206 | 0.0673940 | 1994.766874 |
| LinearModel2 | Training Set | 3.219356 | 0.2267470 | 2.606465 |
| LinearModel3 | Training Set | 3.432846 | 0.1207904 | 2.813728 |
| LinearModel4 | Training Set | 3.175672 | 0.2475892 | 2.460059 |
| model | set | rmse | rsq | mae |
|---|---|---|---|---|
| LinearModel1 | Testing Set | 4138.492195 | 0.0803887 | 1919.310196 |
| LinearModel2 | Testing Set | 3.243748 | 0.2326278 | 2.621742 |
| LinearModel3 | Testing Set | 3.484797 | 0.1132622 | 2.874374 |
| LinearModel4 | Testing Set | 3.218473 | 0.2448160 | 2.488462 |
Results of the logistic regression models are presented below for both the training and the testing datasets. The model that performed best using these classification models is logistic model 3. Although the accuracy of logistic model 1 and 2 are shown higher, the metric of importance here is the Kappa. The Kappa is a similar measure to accuracy however it is normalized by the accuracy that would be expected by chance alone which is useful with class imbalances presented in datasets. Due to the undersampling of the majority class for this model, the true accuracy (KAP), was best under model 3. the results of these models were consistent between both the testing and the training datasets implying minimal overfitting/underfitting and the ability for the model to generalize well.
| model | set | accuracy | kap |
|---|---|---|---|
| LogisticModel1 | Training Set | 0.7907205 | 0.3775704 |
| LogisticModel2 | Training Set | 0.7998693 | 0.4142744 |
| LogisticModel3 | Training Set | 0.7506242 | 0.5012484 |
| model | set | accuracy | kap |
|---|---|---|---|
| LogisticModel1 | Testing Set | 0.7887255 | 0.3842650 |
| LogisticModel2 | Testing Set | 0.7855741 | 0.3788016 |
| LogisticModel3 | Testing Set | 0.7262022 | 0.4075439 |
The final thing we will take a look at is our variable importance for each model
The final figure presented below shows the feature importance in each model. Note that not all features are presented from the models, but only the top 6 from each. Amongst all the models with the exception of linear model 3 (which did not consider ordinal features), being in an urban city that was highly rural was best predictor of when someone was in a car crash and if they would file claims against their insurance accordingly. Motor vehicle points was also a consistent feature that shows that people who have a history of violations tend to be more likely to get into a car accident. This is consistent with how car insurance plans are priced with high premiums being charged in urban areas like NYC relative to the suburbs.