Overview

In this homework assignment, we explored, analyzed and modeled a dataset containing approximately 8000 records representing customers at an auto insurance company.

We built multiple linear regression models on the continuous variable TARGET_AMT and binary logistic regression models on the boolean variable TARGET_FLAG to predict the probability that a person will crash their car and to predict the associated costs.

We built several models using different techniques and variable selection. To best assess our predictive models, we created a hold-out validation set from our training data, before making predictions on a separate evaluation dataset that did not contain the targets.


1. Data Exploration

The training dataset contained 8161 observations of 26 predictor variables, where each record represented an individual customer. These variables included demographic measures such as age and gender, socioeconomic measures such as education and household income, and vehicle-specific metrics such as car model, age, and assessed value.

The evaluation dataset contained 2141 observations over the same predictor variables.

1.1 Summary Statistics

Each record in the training dataset also included two response variables. The first response variable, TARGET_FLAG, was a boolean where “1” indicated the person was in a car crash. The second response variable, TARGET_AMT was a numeric indicating the cost if a car crash occurred; this value was zero if the person did not crash their car.

A sample of the training data appears in the following table:

While exploring this data, we made the following observations:

  • 14 variables were categorical, and 12 were numeric.
  • There was no missing data for character variables.
  • Some currency variables were strings with ‘$’ symbols instead of numerics (OLDCLAIM, BLUEBOOK).
  • Some character variables included prefixes that could be removed for readability (EDUCATION, JOB).
  • Numeric variables with missing values included YOJ (6%), INCOME (5%), HOME_VAL (6%), CAR_AGE (6%), and AGE (1%).
  • Most of the numeric variables had a minimum of zero.
  • One variable, CAR_AGE had a negative value of -3, which didn’t make intuitive sense.
  • Some character variables should have been numeric and vice-versa.
Summary: Training Dataset
variable type missing complete% min max mean
INCOME character 0 1.000 NA NA NA
PARENT1 character 0 1.000 NA NA NA
HOME_VAL character 0 1.000 NA NA NA
MSTATUS character 0 1.000 NA NA NA
SEX character 0 1.000 NA NA NA
EDUCATION character 0 1.000 NA NA NA
JOB character 0 1.000 NA NA NA
CAR_USE character 0 1.000 NA NA NA
BLUEBOOK character 0 1.000 NA NA NA
CAR_TYPE character 0 1.000 NA NA NA
RED_CAR character 0 1.000 NA NA NA
OLDCLAIM character 0 1.000 NA NA NA
REVOKED character 0 1.000 NA NA NA
URBANICITY character 0 1.000 NA NA NA
INDEX numeric 0 1.000 1 10302.0 5151.868
TARGET_FLAG numeric 0 1.000 0 1.0 0.264
TARGET_AMT numeric 0 1.000 0 107586.1 1504.325
KIDSDRIV numeric 0 1.000 0 4.0 0.171
AGE numeric 6 0.999 16 81.0 44.790
HOMEKIDS numeric 0 1.000 0 5.0 0.721
YOJ numeric 454 0.944 0 23.0 10.499
TRAVTIME numeric 0 1.000 5 142.0 33.486
TIF numeric 0 1.000 1 25.0 5.351
CLM_FREQ numeric 0 1.000 0 5.0 0.799
MVR_PTS numeric 0 1.000 0 13.0 1.696
CAR_AGE numeric 510 0.938 -3 28.0 8.328
Summary: Evaluation Dataset
variable type missing complete% min max mean
INCOME character 0 1.000 NA NA NA
PARENT1 character 0 1.000 NA NA NA
HOME_VAL character 0 1.000 NA NA NA
MSTATUS character 0 1.000 NA NA NA
SEX character 0 1.000 NA NA NA
EDUCATION character 0 1.000 NA NA NA
JOB character 0 1.000 NA NA NA
CAR_USE character 0 1.000 NA NA NA
BLUEBOOK character 0 1.000 NA NA NA
CAR_TYPE character 0 1.000 NA NA NA
RED_CAR character 0 1.000 NA NA NA
OLDCLAIM character 0 1.000 NA NA NA
REVOKED character 0 1.000 NA NA NA
URBANICITY character 0 1.000 NA NA NA
TARGET_FLAG logical 2141 0.000 NA NA NA
TARGET_AMT logical 2141 0.000 NA NA NA
INDEX numeric 0 1.000 3 10300 5150.099
KIDSDRIV numeric 0 1.000 0 3 0.163
AGE numeric 1 1.000 17 73 45.017
HOMEKIDS numeric 0 1.000 0 5 0.717
YOJ numeric 94 0.956 0 19 10.379
TRAVTIME numeric 0 1.000 5 105 33.152
TIF numeric 0 1.000 1 25 5.245
CLM_FREQ numeric 0 1.000 0 5 0.809
MVR_PTS numeric 0 1.000 0 12 1.766
CAR_AGE numeric 129 0.940 0 26 8.183

1.2 Distributions

Many of these distributions for the numeric variables seemed highly skewed and non-normal. As part of our data preparation, we used power transformations to find whether transforming variables to more normal distributions improved our models’ efficacy.

By checking the frequency of the variables with few unique values (HOMEKIDS, KIDSDRIV, CLM_FREQ), we assumed that these variables might perform better if transformed to binary (1 if the value is greater than 0). As demonstrated by the graph below, more than 50% of the data for these variables was 0:

We found that the distribution of factor variables could help determine if transformations should be applied. For example, variable PARENT1 could become binary, and CAR_AGE could be transformed by putting it into buckets.

This analysis also suggested that having more than one parent was correlated with crashes and that High School students, blue-collar employees, SUV cars, and revoked drivers were also more likely to be in a car crash.


1.3 Box Plots

We observed large ranges for INCOME, KIDSDRIV, and OLDCLAIM. BLUEBOOK, INCOME, OLDCLAIM, TRAVTIME seemed to have a high frequency of outliers. Additionally, it seemed that people with higher income, car age and home values were less likely to get into a car crash.


1.4 Scatter Plots

Interestingly, none of our predictors appeared to have strong linear relationships to our TARGET_AMT response variable, which is a primary assumption of linear regression. This suggested that alternative methods might be more successful in modeling the relationships.


1.5 Correlation Matrix

We observed substantial multi-collinearity in our training data, especially between INCOME and HOME_VAL (+58%), OLDCLAIM and CLAIM_FREQ (+50%), KIDSDRIV and HOMEKIDS (+46%), and AGE and HOMEKIDS (-45%). In building our models, we explored methods for reducing these effects.


2. Data Preparation

2.1 Data types

In order to work with our training and evaluation datasets, we needed to first convert some variables to more useful data types:

  • Convert currency columns from character to integer: INCOME,HOME_VAL,BLUEBOOK and OLDCLAIM.
  • Convert character columns to factors: TARGET_FLAG, CAR_TYPE, CAR_USE, EDUCATION, JOB, MSTATUS, PARENT1, RED_CAR, REVOKED, SEX and URBANICITY.

2.2 Transformations, Outliers and Missing Values

Before we went further, we needed to identify and handle any missing, NA or negative data values so we could perform log transformations and regression. We performed these operations on both the training and evaluation datasets.

First, we applied transformations to clean up and align formatting of our variables:

  • Dropped the INDEX variable.
  • Removed “z_” from all character class values (MSTATUS, SEX, JOB, CAR_TYPE, URBANICITY, EDUCATION)
  • Removed “$” from the numeric values (INCOME, HOME_VAL, BLUEBOOK, OLDCLAIM).
  • Updated RED_CAR, replace [no,yes] values with [No, Yes] values.
  • Replaced JOB blank values with ‘Unknown’.
  • Transformed variables HOMEKIDS, KIDSDRIV to binary; the value is TRUE if the original value is greater than 0.

Next, we manually adjusted two special cases of missing or outlier values.

  • In cases where YOJ is zero and INCOME is NA, we set INCOME to zero to avoid imputing new values over legitimate instances of non-employment.
  • There was also at least one value of CAR_AGE less than zero in the training data - we assumed this was a data collection error and set it to zero (representing a brand-new car.)

We used MICE to impute our remaining variables with missing values - AGE, YOJ, CAR_AGE, INCOME and HOME_VALUE:

  • It was reasonable to assume that relationships exist between these variables (older, more years on the job may correlate with higher income and home value). Taking simple means or medians might have suppressed those features, and MICE provided a better imputation.

To give our models more variables to work with, we engineered some additional features:

  • Created bin values for CAR_AGE, HOME_VAL and TIF.
  • Created two-level factors for HOMEKIDS and KIDSDRIV
  • Created dummy variables for two-level factors, MALE, MARRIED, LIC_REVOKED, CAR_RED, PRIVATE_USE, SINGLE_PARENT and URBAN.

Next we wanted to consider any power transformations for variables with skewed distributions. For example, our numeric response variable TARGET_AMT was a good candidate for transformation as its distribution was very highly skewed, and the assumption of normality is required in order to apply linear regression.

  • Log transformation applied to variables TARGET_AMT, OLDCLAIM, INCOME to transform their distributions from right-skewed to normally distributed.
  • BoxCox transformation applied to variables BLUEBOOK, TRAVTIME, TIF.

A summary and distributions of the final, transformed training dataset appears below (with a temporary numeric variable CAR_CRASH to represent the response variable for visualization purposes.). Post-transformation, distributions for many of the variables became closer to normal.

Summary: Transformed Training Dataset
variable type missing complete% min max mean
TARGET_FLAG factor 0 1 NA NA NA
KIDSDRIV factor 0 1 NA NA NA
HOMEKIDS factor 0 1 NA NA NA
EDUCATION factor 0 1 NA NA NA
JOB factor 0 1 NA NA NA
CAR_TYPE factor 0 1 NA NA NA
MALE factor 0 1 NA NA NA
MARRIED factor 0 1 NA NA NA
LIC_REVOKED factor 0 1 NA NA NA
CAR_RED factor 0 1 NA NA NA
PRIVATE_USE factor 0 1 NA NA NA
SINGLE_PARENT factor 0 1 NA NA NA
URBAN factor 0 1 NA NA NA
TARGET_AMT numeric 0 1 0.000 11.586 2.183
AGE numeric 0 1 16.000 81.000 44.782
YOJ numeric 0 1 0.000 23.000 10.482
INCOME numeric 0 1 0.000 12.813 9.927
HOME_VAL numeric 0 1 0.000 885282.000 154644.846
TRAVTIME numeric 0 1 2.999 45.614 15.082
BLUEBOOK numeric 0 1 62.211 381.011 182.315
TIF numeric 0 1 0.000 4.702 1.634
OLDCLAIM numeric 0 1 0.000 10.951 3.386
CLM_FREQ numeric 0 1 0.000 5.000 0.799
MVR_PTS numeric 0 1 0.000 13.000 1.696
CAR_AGE numeric 0 1 0.000 28.000 8.340
CAR_AGE_BIN numeric 0 1 1.000 4.000 2.477
HOME_VAL_BIN numeric 0 1 1.000 4.000 2.451
TIF_BIN numeric 0 1 1.000 4.000 2.412

Finally, we split our training data into train (75%) and validation (25%) datasets to evaluate model performance before we proceeded to prediction. Slightly different variations were used depending on each model’s needs and ability to handle binned variables.


3. Binary Logistic Regression

We used Binary Logistic Regression to classify our response variable TARGET_FLAG, the probability of a car crash for a given observation.

3.1 Logistic Model 1 - Original Data

Our first step was to build a logistic regression based on the original training dataset with minimal cleanup and imputation of missing values but without any power transformations for normal distributions.

We applied Stepwise feature selection using stepAIC to reduce multicollinearity and optimize performance.

Observations 6121
Dependent variable TARGET_FLAG
Type Generalized linear model
Family binomial
Link logit
χ(32) 1584.51
Pseudo-R (Cragg-Uhler) 0.33
Pseudo-R (McFadden) 0.22
AIC 5545.63
BIC 5767.37
Est. S.E. z val. p
(Intercept) -2.75 0.23 -11.86 0.00
KIDSDRIV1 0.63 0.11 5.75 0.00
HOMEKIDS1 0.23 0.10 2.21 0.03
INCOME -0.00 0.00 -3.26 0.00
HOME_VAL -0.00 0.00 -2.79 0.01
EDUCATIONBachelors -0.39 0.13 -3.07 0.00
EDUCATIONHigh School 0.02 0.11 0.22 0.83
EDUCATIONMasters -0.25 0.19 -1.33 0.18
EDUCATIONPhD 0.00 0.23 0.02 0.99
JOBClerical 0.11 0.12 0.93 0.35
JOBDoctor -0.87 0.33 -2.67 0.01
JOBHome Maker -0.20 0.17 -1.16 0.25
JOBLawyer -0.14 0.21 -0.67 0.50
JOBManager -0.92 0.16 -5.69 0.00
JOBProfessional -0.17 0.14 -1.22 0.22
JOBStudent -0.07 0.14 -0.47 0.64
JOBUnknown -0.37 0.21 -1.73 0.08
TRAVTIME 0.01 0.00 6.63 0.00
BLUEBOOK -0.00 0.00 -4.01 0.00
TIF -0.06 0.01 -6.75 0.00
CAR_TYPEPanel Truck 0.69 0.17 3.96 0.00
CAR_TYPEPickup 0.59 0.11 5.12 0.00
CAR_TYPESports Car 0.98 0.12 7.93 0.00
CAR_TYPESUV 0.73 0.10 7.35 0.00
CAR_TYPEVan 0.69 0.14 4.95 0.00
OLDCLAIM -0.00 0.00 -3.32 0.00
CLM_FREQ 0.20 0.03 5.96 0.00
MVR_PTS 0.11 0.02 7.24 0.00
MARRIED1 -0.55 0.10 -5.43 0.00
LIC_REVOKED1 0.90 0.11 8.55 0.00
PRIVATE_USE1 -0.73 0.11 -6.91 0.00
SINGLE_PARENT1 0.22 0.14 1.56 0.12
URBAN1 2.42 0.13 18.22 0.00
Standard errors: MLE

Model Performance

The following variables had highly positive coefficients in our model, indicating a higher chance to be involved in a car crash:

Positive Coefficients
Est p
URBAN1 2.422 0.000
CAR_TYPESports Car 0.983 0.000
LIC_REVOKED1 0.901 0.000
CAR_TYPESUV 0.730 0.000
CAR_TYPEPanel Truck 0.692 0.000
CAR_TYPEVan 0.692 0.000
KIDSDRIV1 0.632 0.000
CAR_TYPEPickup 0.588 0.000
HOMEKIDS1 0.228 0.027

Variables with highly negative coefficients indicated a lower chance to be involved in a car crash:

Negative Coefficients
Est p
(Intercept) -2.753 0.000
JOBManager -0.924 0.000
JOBDoctor -0.870 0.008
PRIVATE_USE1 -0.727 0.000
MARRIED1 -0.553 0.000
EDUCATIONBachelors -0.387 0.002

The null deviance of 7064.14 defined how well the target variable could be predicted by a model with only an intercept term.

The residual deviance of 5479.63 defined how well the target variable can be predicted by the AIC model that we fit with the predictor variables listed above. The lower the value, the better the model’s predictions of the response variable.

The p-value associated with this Chi-Square Statistic was 0 (less than .05), so the model could be useful.

The Akaike information criterion (AIC) was 5545.63. The lower the AIC value, the better the model’s ability to fit the data. We used this metric to compare the relative performance of the three Logistic models.

The Accuracy was 0.798, Sensitivity was 0.43, and Specificity was 0.93.

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1399  312
##          1  103  226
##                                           
##                Accuracy : 0.7966          
##                  95% CI : (0.7784, 0.8138)
##     No Information Rate : 0.7363          
##     P-Value [Acc > NIR] : 1.293e-10       
##                                           
##                   Kappa : 0.4016          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.4201          
##             Specificity : 0.9314          
##          Pos Pred Value : 0.6869          
##          Neg Pred Value : 0.8177          
##              Prevalence : 0.2637          
##          Detection Rate : 0.1108          
##    Detection Prevalence : 0.1613          
##       Balanced Accuracy : 0.6757          
##                                           
##        'Positive' Class : 1               
## 

Model Assumptions

We evaluated the modeling assumptions using standard diagnostic plots, marginal model plots, and a Variance Inflation Factor (VIF) to assess collinearity.

The resulting plot 1 below (predicted values vs. residuals) helped identify over- or under-predicting. It seems that we experienced under-fitting at lower predicted values, with the predicted proportions being larger than the observed proportions.

On the other hand, we saw over-fitting for a couple of predictor patterns at higher predicted values. The labeled points refer to rows in our data.

Examining the Standardized Pearson Residuals plot, the model seemed good until the middle range but there were extremes on the left. Some normality of the residuals for binomial logistic regression was just evidence of a decent-fitting model.

The Standardized Residuals plot seems to show constant variance, but there are some outliers.

The marginal model plot is a very useful graphical method for deciding whether a logistic regression model is adequate or not. The marginal model plots below showed reasonable agreement across the two sets of fits indicating this was a valid model.

All variables had a VIF less than 5 except JOB and EDUCATION, indicating that multi-collinearity may still have been a problem for this model.

Variance Inflation Factor
GVIF df adj
JOB 19.083 8 1.202
EDUCATION 8.347 4 1.304
INCOME 2.828 1 1.682
CAR_TYPE 2.624 5 1.101
PRIVATE_USE 2.418 1 1.555
SINGLE_PARENT 2.345 1 1.531
MARRIED 2.296 1 1.515
HOMEKIDS 2.276 1 1.509
HOME_VAL 2.032 1 1.425
BLUEBOOK 1.796 1 1.340
OLDCLAIM 1.641 1 1.281
CLM_FREQ 1.468 1 1.211
KIDSDRIV 1.363 1 1.168
LIC_REVOKED 1.317 1 1.148
MVR_PTS 1.159 1 1.076
URBAN 1.134 1 1.065
TRAVTIME 1.038 1 1.019
TIF 1.011 1 1.006

3.2 Logistic Model 2 - Transformed

Next, we built a Logistic regression based on the original training dataset with cleanup and imputation of missing values, plus power transformations for normal distributions.

We applied Stepwise feature selection using stepAIC to reduce multicollinearity and optimize performance.

Observations 6121
Dependent variable TARGET_FLAG
Type Generalized linear model
Family binomial
Link logit
χ(33) 1631.73
Pseudo-R (Cragg-Uhler) 0.34
Pseudo-R (McFadden) 0.23
AIC 5500.41
BIC 5728.87
Est. S.E. z val. p
(Intercept) -1.49 0.32 -4.68 0.00
KIDSDRIV1 0.61 0.11 5.51 0.00
HOMEKIDS1 0.16 0.10 1.58 0.11
YOJ 0.03 0.01 2.32 0.02
INCOME -0.10 0.02 -4.70 0.00
EDUCATIONBachelors -0.39 0.13 -3.12 0.00
EDUCATIONHigh School 0.01 0.11 0.10 0.92
EDUCATIONMasters -0.32 0.19 -1.71 0.09
EDUCATIONPhD -0.38 0.22 -1.70 0.09
JOBClerical -0.03 0.12 -0.27 0.79
JOBDoctor -0.97 0.34 -2.83 0.00
JOBHome Maker -0.38 0.19 -2.00 0.05
JOBLawyer -0.29 0.22 -1.32 0.19
JOBManager -1.00 0.16 -6.12 0.00
JOBProfessional -0.26 0.14 -1.86 0.06
JOBStudent -0.47 0.17 -2.78 0.01
JOBUnknown -0.31 0.21 -1.47 0.14
TRAVTIME 0.04 0.01 7.08 0.00
BLUEBOOK -0.00 0.00 -4.83 0.00
CAR_TYPEPanel Truck 0.57 0.17 3.34 0.00
CAR_TYPEPickup 0.55 0.12 4.71 0.00
CAR_TYPESports Car 0.97 0.13 7.66 0.00
CAR_TYPESUV 0.74 0.10 7.42 0.00
CAR_TYPEVan 0.70 0.14 5.00 0.00
OLDCLAIM 0.02 0.01 1.73 0.08
CLM_FREQ 0.08 0.05 1.50 0.13
MVR_PTS 0.10 0.02 6.24 0.00
MARRIED1 -0.51 0.10 -5.10 0.00
LIC_REVOKED1 0.67 0.09 7.08 0.00
PRIVATE_USE1 -0.71 0.11 -6.66 0.00
SINGLE_PARENT1 0.28 0.14 2.01 0.04
URBAN1 2.48 0.13 18.61 0.00
HOME_VAL_BIN -0.17 0.04 -4.17 0.00
TIF_BIN -0.22 0.03 -7.50 0.00
Standard errors: MLE

Model Performance

The following variables had highly positive coefficients in our model, indicating a much higher chance to be involved in a car crash:

Positive Coefficients
Est. p
URBAN1 2.478 0.000
CAR_TYPESports Car 0.973 0.000
CAR_TYPESUV 0.742 0.000
CAR_TYPEVan 0.701 0.000
LIC_REVOKED1 0.666 0.000
KIDSDRIV1 0.610 0.000
CAR_TYPEPanel Truck 0.566 0.001
CAR_TYPEPickup 0.551 0.000
SINGLE_PARENT1 0.283 0.044

Variables with highly negative coefficients indicated a lower chance to be involved in a car crash:

Negative Coefficients
Est. p
(Intercept) -1.488 0.000
JOBManager -1.002 0.000
JOBDoctor -0.970 0.005
PRIVATE_USE1 -0.705 0.000
MARRIED1 -0.515 0.000
JOBStudent -0.469 0.006
EDUCATIONBachelors -0.391 0.002
JOBHome Maker -0.382 0.046
TIF_BIN -0.220 0.000

The results were similar to Logistic Model 1.

The null deviance was 7064.14, and the residual deviance was 5432.41.

The p-value associated with the Chi-Square Statistic was 0 (less than .05), so the model could be useful.

The Akaike information criterion (AIC) was 5500.41.

The Accuracy was 0.794, Sensitivity was 0.43, and Specificity was 0.92.

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1389  308
##          1  113  230
##                                          
##                Accuracy : 0.7936         
##                  95% CI : (0.7754, 0.811)
##     No Information Rate : 0.7363         
##     P-Value [Acc > NIR] : 9.692e-10      
##                                          
##                   Kappa : 0.3986         
##                                          
##  Mcnemar's Test P-Value : < 2.2e-16      
##                                          
##             Sensitivity : 0.4275         
##             Specificity : 0.9248         
##          Pos Pred Value : 0.6706         
##          Neg Pred Value : 0.8185         
##              Prevalence : 0.2637         
##          Detection Rate : 0.1127         
##    Detection Prevalence : 0.1681         
##       Balanced Accuracy : 0.6761         
##                                          
##        'Positive' Class : 1              
## 

Model Assumptions

The resulting plots were similar to Model 1 with under-fitting at lower predicted values (predicted proportions being larger than the observed proportions) and over-fitting at higher predicted values.

Examining the Standardized Pearson Residuals plot, the model seemed good until the middle range but there were extremes on the left. Some normality of the residuals for binomial logistic regression were just evidence of a decent-fitting model.

The Standardized Residuals plot seemed to have a constant variance, though there are some outliers.

The marginal model plots showed reasonable agreement across the two sets of fits indicating this was a valid model. The plots looked better than for Model 1, with data and model values matching almost everywhere.

All variables had a VIF less than 5 besides JOB and EDUCATION, indicating multi-collinearity may still have been a problem for the model.

Variance Inflation Factor
GVIF df adj
JOB 30.632 8 1.238
EDUCATION 7.314 4 1.282
INCOME 4.124 1 2.031
OLDCLAIM 3.598 1 1.897
CLM_FREQ 3.365 1 1.834
CAR_TYPE 2.522 5 1.097
PRIVATE_USE 2.429 1 1.559
YOJ 2.376 1 1.541
SINGLE_PARENT 2.294 1 1.515
HOMEKIDS 2.255 1 1.502
MARRIED 2.225 1 1.492
HOME_VAL_BIN 1.776 1 1.333
BLUEBOOK 1.658 1 1.287
KIDSDRIV 1.355 1 1.164
MVR_PTS 1.235 1 1.111
URBAN 1.146 1 1.071
TRAVTIME 1.037 1 1.019
LIC_REVOKED 1.029 1 1.015
TIF_BIN 1.012 1 1.006

3.3 Logistic Model 3 - Lasso

Lasso Regression was a good candidate for this dataset since we were dealing with a large number of complex variables. Lasso helps identify the most important variables and reduces the model’s complexity.

The cv.glmnet() function was also used for our Logistic regression model. K-fold cross-validation was performed with variable selection using Lasso regularization. The following attribute settings were selected:

  • type.measure = “class” - The type.measure was set to class to minimize the mis-classification errors of the model since the accurate classification of the validation data set is the desired outcome.
  • nfold = 10 - Given the size of the training dataset, we opted for 10-fold cross-validation as a default.
  • family = binomial - For Logistic Regression, the family attribute of the function was set to binomial.
  • link = logit - For this model, we choose the default link function for a logistic model.
  • alpha =1 - The alpha value of 1 set the variable shrinkage method to lasso.
  • standardize = TRUE - Finally, we explicitly set the standardization attribute to TRUE; this normalized the prediction variables around a mean of zero and a standard deviation of one before modeling.
## 
## Call:  cv.glmnet(x = X, y = Y, nfolds = 5, family = "binomial", link = "logit",      standardize = TRUE, alpha = 1) 
## 
## Measure: Binomial Deviance 
## 
##       Lambda Index Measure       SE Nonzero
## min 0.001145    49  0.9092 0.004394      34
## 1se 0.004623    34  0.9133 0.003342      28

Model Performance

We explored model performance by extracting coefficients for values of lambda.min and lambda.1se.

  • The coefficients extracted using lambda.min minimized the mean cross-validated error. The resulting model included 35 non-zero coefficients and had an AIC of -1514, the lowest AIC and the highest-performing model of the two.

  • The coefficients extracted using lambda.1se produced the most regularized model (cross-validated error was within one standard error of the minimum). The resulting model included 29 non-zero coefficients and had an AIC of -1485.

We used the coefficients extracted using lambda.min in our predictions.

The confusion matrix highlighted an accuracy of 79.8%.

##          True
## Predicted    0   1 Total
##     0     1400 314  1714
##     1      102 224   326
##     Total 1502 538  2040
## 
##  Percent Correct:  0.7961
## $deviance
## lambda.min 
##  0.8955268 
## attr(,"measure")
## [1] "Binomial Deviance"
## 
## $class
## lambda.min 
##  0.2127103 
## attr(,"measure")
## [1] "Misclassification Error"
## 
## $auc
## [1] 0.8128818
## attr(,"measure")
## [1] "AUC"
## 
## $mse
## lambda.min 
##  0.2912111 
## attr(,"measure")
## [1] "Mean-Squared Error"
## 
## $mae
## lambda.min 
##  0.5876878 
## attr(,"measure")
## [1] "Mean Absolute Error"
## $AICc
## [1] -1514.227
## 
## $BIC
## [1] -1286.156
## $AICc
## [1] -1485.047
## 
## $BIC
## [1] -1297.168

Model Assumptions

Examining the 35 non-zero coefficients for the selected lambda value of lambda.min, we observed the URBAN1 predictor variable had the largest impact on the prediction of a car crash by a factor of 2.5. Reviewing the top 5 predictor variables that impacted the likelihood associated with an accident:

  • URBAN1 - working or living in an urban neighborhood increased the likelihood associated with a crash
  • CAR_TYPE Sports Car - sports cars increased the likelihood
  • LIC_REVOKED1 - a history of having a revoked license increased the likelihood
  • JOB Manager - being a manager reduced the likelihood
  • PRIVATE_USE1 - using a car for private activities reduced the likelihood

The AGE, JOB Student and CAR_RED1 predictor variables dropped out of the final model.

Model Coefficients
Est
URBAN1 2.343
CAR_TYPESports Car 0.943
LIC_REVOKED1 0.851
CAR_TYPESUV 0.706
KIDSDRIV1 0.609
CAR_TYPEVan 0.538
CAR_TYPEPickup 0.491
CAR_TYPEPanel Truck 0.484
HOMEKIDS1 0.231
SINGLE_PARENT1 0.215
CLM_FREQ 0.186
JOBClerical 0.169
MVR_PTS 0.110
MALE1 0.084
EDUCATIONHigh School 0.028
TRAVTIME 0.014
AGE 0.000
JOBStudent 0.000
CAR_RED1 0.000
HOME_VAL 0.000
INCOME 0.000
OLDCLAIM 0.000
BLUEBOOK 0.000
CAR_AGE -0.002
YOJ -0.004
JOBLawyer -0.011
EDUCATIONPhD -0.020
TIF -0.054
JOBProfessional -0.075
JOBHome Maker -0.081
JOBUnknown -0.226
EDUCATIONMasters -0.262
EDUCATIONBachelors -0.358
MARRIED1 -0.521
JOBDoctor -0.693
PRIVATE_USE1 -0.758
JOBManager -0.807
(Intercept) -2.715

The Lasso regression helped solve the multi-collinearity issue by selecting the variables with the largest coefficients and scaling the rest to very small values.


3.4 Model selection

To choose among our three models for prediction, we focused on performance only - comparing ROC curves, performance statistics and confusion matrices.

The models did not differ in terms of variables selected as predictors. Models 1 and 2 were similar as we used Stepwise selection (stepAIC) for both, although the application of BoxCox transformations did lower the AIC result for Model 2.

Even though Model 3’s Accuracy was lower than Model 1, its AIC and Deviance were by far the lowest of the three, so we chose it for predictions.

Logistic Model Comparison
Model Accuracy Classification error rate F1 Deviance R2 Sensitivity Specificity Precision AUC AIC
Model #1: Original data 0.797 0.203 0.521 5479.625 0.224 0.420 0.931 0.687 0.6757496 5545.625
Model #2: Transformed data 0.794 0.206 0.522 5432.408 0.231 0.428 0.925 0.671 0.6761381 5500.408
Model #3: Lasso 0.796 0.204 0.519 0.893 NA 0.416 0.932 0.687 0.8145843 -1514.227

4. Multiple Linear Regression

We used Multiple Linear Regression to model the TARGET_AMT response variable, the estimated cost of a crash for a given observation. We only included observations where TARGET_FLAG=1 since if there was no crash, then no costs were incurred.

4.1 Linear Model 1 - Original Data

Our first step was to build a linear regression based on the original training dataset with minimal cleanup and imputation of missing values but without any power transformations for normal distributions.

We applied Stepwise feature selection using stepAIC to reduce multicollinearity and optimize performance.

With stepAIC, we selected variables that contributed the most performance to our model. Based on the results of the function, we included BLUEBOOK, MVR_PTS, MALE1, and LIC_REVOKED1.

Observations 1617
Dependent variable TARGET_AMT
Type OLS linear regression
F(4,1612) 9.1841
R 0.0223
Adj. R 0.0199
Est. S.E. t val. p
(Intercept) 3502.9583 479.1917 7.3101 0.0000
BLUEBOOK 0.1278 0.0241 5.2951 0.0000
MVR_PTS 126.7957 76.4841 1.6578 0.0976
MALE1 634.2666 400.9366 1.5820 0.1139
LIC_REVOKED1 -704.6944 489.8423 -1.4386 0.1505
Standard errors: OLS

Model Performance

The F-statistic was 9.18, the adjusted R-squared was 0.02, and out of the 24 variables, 2 had statistically significant p-values.

RMSE was 7987.76, showing how far predictions fell from measured true values using Euclidean distance.

The adjusted R2 indicated that only 2% of the variance in the response variable could be explained by the predictor variables.

Positive coefficients indicated an increase in the cost of a crash. In this model Males, cars with high Bluebook values, and drivers with a high number of Motor Vehicle Points tended to have more expensive crashes.

Similarly, negative coefficients indicated less-expensive crashes. In this model, drivers with License Revocations tended to have less expensive crashes.

The F-statistic was low and the model’s overall p-value was not statistically significant.

Model Assumptions

We evaluated the linear modeling assumptions using standard diagnostic plots, and Variance Inflation Factor (VIF) to assess collinearity.

Our initial review of the diagnostic plots for this model showed some deviation from linear modeling assumptions. The Q-Q plot showed some deviations from the normal distribution at the left end. The residuals-fitted and standardized residuals-fitted plots showed a curve in the main cluster, which indicated that we did not have constant variance.

The variables included in our model had a VIF of 1, indicating the variables were not correlated.


4.2 Linear Model 2 - Transformed

Next, we built a linear model based on the training dataset with cleanup and imputation of missing values, plus power transformations for normal distributions.

We applied Stepwise feature selection using stepAIC to reduce multicollinearity and optimize performance.

Based on the the results of the function, we included BLUEBOOK, MVR_PTS, CLM_FREQ, SINGLE_PARENT1 and EDUCATION.

Observations 1617
Dependent variable TARGET_AMT
Type OLS linear regression
F(8,1608) 4.8536
R 0.0236
Adj. R 0.0187
Est. S.E. t val. p
(Intercept) 7.9712 0.0859 92.7957 0.0000
EDUCATIONBachelors -0.0812 0.0622 -1.3055 0.1919
EDUCATIONHigh School -0.0240 0.0570 -0.4216 0.6734
EDUCATIONMasters 0.0562 0.0707 0.7961 0.4261
EDUCATIONPhD 0.1449 0.0972 1.4897 0.1365
BLUEBOOK 0.0017 0.0004 3.9508 0.0001
CLM_FREQ -0.0321 0.0169 -1.8999 0.0576
MVR_PTS 0.0197 0.0080 2.4573 0.0141
SINGLE_PARENT1 0.0942 0.0484 1.9443 0.0520
Standard errors: OLS

Model Performance

The F-statistic was 4.85, and out of the 24 variables, 2 had statistically significant p-values.

The Adjusted R-Squared of 0.02 was similar to Model 1, only explaining about 2% of the total variance in the response variable TARGET_AMT.

The RMSE was 0.8 which was a great improvement over Model 1.

Positive coefficients indicated an increase in the cost of a crash. In this model Single parents, people with Masters and PhD degrees, and drivers with a high number of Motor Vehicle Points tended to have more expensive crashes.

Similarly, negative coefficients indicate less-expensive crashes. In this model, drivers with High School and Bachelor’s degrees, and drivers with high numbers of previous claims tended to have less expensive crashes.

Model Assumptions

An examination of the residuals indicated most of the key assumptions for linear regression wer met - the Residuals vs Fitted plot showed a more constant variability of the residuals, and the Q-Q plot indicated a much better level of normality compaed to Model 1.

The variables included in our model had a VIF of 1 (or slightly above), indicating the variables were not correlated, for the most part.


4.3 Linear Model 3 - Lasso

The cv.glmnet() function was used to perform k-fold cross-validation with variable selection using lasso regularization. The following attribute settings were selected for the model:

  • type.measure = “mse” - The type.measure is set to minimize the Mean Squared Error for the model.
  • nfold = 10 - Given the size of the dataset we defaulted to 10-fold cross-validation.
  • family = gaussian - For Linear Regression
  • alpha = 1 - The alpha value of 1 sets the variable shrinkage method to lasso.
  • standardize = TRUE - Finally, we explicitly set the standardization attribute to TRUE; this will normalize the prediction variables around a mean of zero and a standard deviation of one before modeling.

The resulting model was explored by extracting coefficients at two different values for lambda, lambda.min and lambda.1se respectively.

The coefficients extracted using lambda.min minimized the mean cross-validated error. The resulting model included 1 non-zero coefficient and has an AIC of -1584023783.

The coefficients extracted using lambda.1se did not produce a valid model - all variables dropped out.

The coefficients extracted using lambda.min resulted in only a single valid variable for the linear regression, BLUEBOOK.

This variable made intuitive sense - as the cost of the vehicle increases, so must the cost of the repair. However, we had imagined that other variables would factor more significantly into the relative cost of an accident. As mentioned earlier, the dataset showed a high degree of multi-collinearity.

## 
## Call:  cv.glmnet(x = X, y = Y, type.measure = "mse", nfolds = 10, family = "gaussian",      standardize = TRUE, alpha = 1) 
## 
## Measure: Mean-Squared Error 
## 
##     Lambda Index  Measure       SE Nonzero
## min  424.8    11 64549038 13487467       1
## 1se 1077.1     1 65317132 13619746       0

## $mse
## lambda.min 
##   64278815 
## attr(,"measure")
## [1] "Mean-Squared Error"
## 
## $mae
## lambda.min 
##   3708.151 
## attr(,"measure")
## [1] "Mean Absolute Error"
## $AICc
## [1] -1584023783
## 
## $BIC
## [1] -1584023777
## $AICc
## [1] 0
## 
## $BIC
## [1] 0
Model Coefficients
Est
(Intercept) 4655.667
BLUEBOOK 0.079
KIDSDRIV1 0.000
AGE 0.000
HOMEKIDS1 0.000
YOJ 0.000
INCOME 0.000
HOME_VAL 0.000
EDUCATIONBachelors 0.000
EDUCATIONHigh School 0.000
EDUCATIONMasters 0.000
EDUCATIONPhD 0.000
JOBClerical 0.000
JOBDoctor 0.000
JOBHome Maker 0.000
JOBLawyer 0.000
JOBManager 0.000
JOBProfessional 0.000
JOBStudent 0.000
JOBUnknown 0.000
TRAVTIME 0.000
TIF 0.000
CAR_TYPEPanel Truck 0.000
CAR_TYPEPickup 0.000
CAR_TYPESports Car 0.000
CAR_TYPESUV 0.000
CAR_TYPEVan 0.000
OLDCLAIM 0.000
CLM_FREQ 0.000
MVR_PTS 0.000
CAR_AGE 0.000
MALE1 0.000
MARRIED1 0.000
LIC_REVOKED1 0.000
CAR_RED1 0.000
PRIVATE_USE1 0.000
SINGLE_PARENT1 0.000
URBAN1 0.000

Model Performance

The lasso model using coefficients extracted at lambda.min was used to predict the 536 test cases and compare the predicted insurance TARGET_AMT to the actual cost of a car crash.

Using the yardstick package to measure model performance, mape, smape and mpe returned NaNs while the mase = 0.696 and rmse = 6614.

## # A tibble: 5 x 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 mape    standard     165.   
## 2 smape   standard      58.0  
## 3 mase    standard       0.696
## 4 mpe     standard    -148.   
## 5 rmse    standard    6614.

By analyzing graphs of our predictions and errors, we discovered a shortcoming of the lasso model - it failed to predict any costs less than $1,000.

Model Assumptions

The Lasso method reduced multi-collinearity, keeping all the features but reducing the magnitude of the coefficients. This is often a good solution when many predictors contribute some information to the model.

The Standardized Residuals plot shows increasing variance for higher response variable values.


4.4 Linear Model Selection

The following table summarizes performance metrics for each model on the training dataset.

During the exploratory data analysis, some predictor variables displayed signs of multi-collinearity, potentially causing unstable regression fits.

Results suggest that the model performance improved after applying transformations and methods for selecting significant predictor variables, such as Lasso and Variable Inflation Factor.

  • Models 1 and 3 did not meet the assumption requirements for normal distributions and constant variance. Both had high RMSE.
  • Model 2 more closely aligned with assumption requirements and performed best for MAPE and Accuracy.
  • F-statistics for Models 1 and 2 were small, with significant p-values.
  • AIC was lower for Model 2 than Model 1, while Model 3 produced unusual, negative results.
  • Model 3 failed to predict costs of less than $1000.

Overall, Model 2 appeared to produce the best fit.

Linear Model Comparison
Model mape smape mase mpe RMSE AIC Adjusted R2 F-statistic
Model 1: Original data 163.897 58.815 0.711 -145.121 6621.310 3.366049e+04 0.020 9.184
Model 2: Transforms + Stepwise 7.299 7.085 0.681 -1.150 0.821 3.883506e+03 0.019 4.854
Model 3: Lasso 165.092 57.972 0.696 -148.312 6613.987 -1.584024e+09 NA NA

5. Predictions

Crash Predictions

Using Binary Logistic Regression Model 3 (Transformation + Lasso), we predicted TARGET_FLAG values for the 2141 observations in our evaluation dataset. We applied all training data transformations to the evaluation data as well. The prediction function generated a vector of probabilities between 0 and 1 that we used to calculate a binomial prediction applying a threshold of 0.5.

Cost Predictions

Using Multiple Linear Regression Model 2 (Transformation + Stepwise), we predicted TARGET_AMT values for the 352 observations in our evaluation dataset with a TARGET_FLAG of 1.

We applied all training data transformations to the evaluation data as well. The prediction function generated a vector of values representing the natural log of their “real world” values - we conducted a final transformation by exponentiation.

We wrote the results to a file “eval_predictions.csv” and saved it to the GitHub repo.


6. Conclusion

Overall, our approach of subsetting the problem into two separate models appeared to be effective. We used a Binary Logistic Regression to model the probability of crashes, and a Multiple Linear Regression to model the associated costs when a car crash occurred.

In both cases, we took steps to transform the data and reduce multi-collinearity for optimal results. For the Logistic model, Lasso Regression enabled us to keep most of the variables in the model but scale their coefficients according to their relative importance. For the Linear model, we used Stepwise feature selection based on AIC to arrive at a minimal feature set, reducing overall error, and producing reasonable outputs.

Crash Factors

The most important predictive factor of car crashes - by far - was living in a high-density Urban environment. Other factors included driving a Sports Car or SUV, having a high number of License Revocations, or having children in the household that drive.

On the other hand, factors that contributed to a lower probability of car crashes included having a Managerial job or being a Doctor, driving a car for Private (non-commercial) uses only, and being Married.

Cost Factors

The most important predictors for higher repair costs included having a high number of MVR points (moving violations), and a high Bluebook valuation of the vehicle. Factors that contributed to a lower predicted cost included a high number of prior Claims, and education levels of Bachelor’s or High School.

Further Questions

While many of these factors made intuitive sense, some caught our attention as candidates for further research and analysis. For example, do drivers with a high number of Claims tend to drive less expensive vehicles, or do they simply tend to get into lots of minor fender-benders instead of major accidents? Similarly, what is the relationship between Single Parents and higher crash likelihoods - even if it is not a strong relationship from a statistical significance standpoint?

For future consideration, we might also wish to look at the imbalanced class distribution between “crash” and “non-crash” events in the training data, where “non-crash” occurs about 75% of the time. Although we could not explore this further, imbalances like this can introduce additional bias to Logistic Regression models and might be addressed through re-sampling or ‘bootstrapping’ training data.


7. References

  1. Faraway, J. J. (2014). Linear Models with R, Second Edition. CRC Press.
  2. Sheather, S. (2009). A Modern Approach to Regression with R. Springer Science & Business Media.
  3. Detecting Multicollinearity Using Variance Inflation Factors | STAT 462. (n.d.). https://online.stat.psu.edu/stat462/node/180/
  4. James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani. 2014. An Introduction to Statistical Learning: With Applications in R. Springer Publishing Company, Incorporated
  5. Faraway, J. J. (2016). Extending the Linear Model with R: Generalized Linear, Mixed Effects and Nonparametric Regression Models. CRC Press.

Appendix: R code

## ----setup, include=FALSE-------------------------------------------------------------------------------------------------------------------------------------
# chunks
knitr::opts_chunk$set(echo=FALSE, eval=TRUE, include=TRUE, 
message=FALSE, warning=FALSE, fig.align='center', fig.height=5)

# libraries
library(yardstick)
library(tidyverse)
library(janitor)
library(mice)
library(readr)
library(skimr)
library(ggplot2)
library(tidyverse)
library(reshape2)
library(MASS)
library(forecast)
library(kableExtra)
library(ggpubr)
library(fastDummies)
library(jtools)
library(performance)
library(broom)
library(pROC)

library(ggmosaic)
library(patchwork)

library(caTools)
library(corrplot)
library(Hmisc)
library(glmnet)
library(vip)
library(caret)

library(psych)
library(GGally)
library(campfin)

library(summarytools)
library(sjPlot)
library(olsrr) 
library(car)

## To load the below libraries you might have to do the following in the console first:
####  install.packages("devtools")
####  install.packages("see")
####  devtools::install_github("haleyjeppson/ggmosaic")
####  devtools::install_github("thomasp85/patchwork")

# ggplot
theme_set(theme_bw())


## ----common functions-----------------------------------------------------------------------------------------------------------------------------------------
#' nice_table
#' 
#' @param df
#' @param fw
nice_table <- function(df, cap=NULL, cols=NULL, dig=3, fw=F){
  if (is.null(cols)) {c <- colnames(df)} else {c <- cols}
  table <- df %>% 
    kable(caption=cap, col.names=c, digits=dig) %>% 
    kable_styling(
      bootstrap_options = c("striped", "hover", "condensed"),
      html_font = 'monospace',
      full_width = fw)
  return(table)
}

#' glmnet_cv_aicc
#'
#' @param fit
#' @param lambda
#'
#' @return
#' @export
#'
#' @examples
glmnet_cv_aicc <- function(fit, lambda = 'lambda.min'){
  whlm <- which(fit$lambda == fit[[lambda]])
  with(fit$glmnet.fit,
       {
         tLL <- nulldev - nulldev * (1 - dev.ratio)[whlm]
         k <- df[whlm]
         n <- nobs
         return(list('AICc' = - tLL + 2 * k + 2 * k * (k + 1) / (n - k - 1),
                     'BIC' = log(n) * k - tLL))
       })
}

#' coeff2dt
#'
#' @param fitobject 
#' @param s 
#'
#' @return
#' @export
#'
#' @examples
coeff2dt <- function(fitobject, s) {
  coeffs <- coef(fitobject, s) 
  coeffs.dt <- data.frame(name = coeffs@Dimnames[[1]][coeffs@i + 1], coefficient = coeffs@x) 

  # reorder the variables in term of coefficients
  return(coeffs.dt[order(coeffs.dt$coefficient, decreasing = T),])
}


## # only required for final knit

## h4, h5, .h4, .h5 {margin-top: 20px !important;}


## ----data-----------------------------------------------------------------------------------------------------------------------------------------------------
# read train and evaluation datasets
train_df <- read.csv("https://raw.githubusercontent.com/cliftonleesps/data621_group1/main/hw4/insurance_training_data.csv")
eval_df <- read.csv("https://raw.githubusercontent.com/cliftonleesps/data621_group1/main/hw4/insurance-evaluation-data.csv")


## ----summary1-------------------------------------------------------------------------------------------------------------------------------------------------
DT::datatable(
      train_df[1:25,],
      extensions = c('Scroller'),
      options = list(scrollY = 350,
                     scrollX = 500,
                     deferRender = TRUE,
                     scroller = TRUE,
                     dom = 'lBfrtip',
                     fixedColumns = TRUE, 
                     searching = FALSE), 
      rownames = FALSE) 


## ----summary-traindf------------------------------------------------------------------------------------------------------------------------------------------
skim_without_charts(train_df) %>%
  dplyr::select(c(skim_variable, skim_type, n_missing, complete_rate,
                  numeric.p0, numeric.p100, numeric.mean)) %>%
  nice_table(cap='Summary: Training Dataset',cols=c('variable','type','missing',
                                    'complete%','min','max','mean'))


## ----summary-evaldf-------------------------------------------------------------------------------------------------------------------------------------------
skim_without_charts(eval_df) %>%
  dplyr::select(c(skim_variable, skim_type, n_missing, complete_rate,
                  numeric.p0, numeric.p100, numeric.mean)) %>%
  nice_table(cap='Summary: Evaluation Dataset',cols=c('variable','type','missing',
                                      'complete%','min','max','mean'))


## ----prep-----------------------------------------------------------------------------------------------------------------------------------------------------
# Drop INDEX
train_df <- train_df %>% dplyr::select(-INDEX)
 eval_df <- eval_df %>% dplyr::select(-c(INDEX, TARGET_FLAG, TARGET_AMT))

## Remove z_ from character class values
z_vars <- c("MSTATUS","SEX","JOB","CAR_TYPE","URBANICITY","EDUCATION")
for (v in z_vars) {
  train_df <- train_df %>% mutate(!!v := str_replace(get(v),"z_",""))
  eval_df <- eval_df %>% mutate(!!v := str_replace(get(v),"z_",""))
}

# Update RED_CAR, replace [no,yes] values with [No, Yes] values
train_df <- train_df %>% mutate( RED_CAR = ifelse(RED_CAR == "no","No","Yes"))
eval_df <- eval_df %>% mutate( RED_CAR = ifelse(RED_CAR == "no","No","Yes"))

# Instead of level "" will be "Unknown"
train_df$JOB[train_df$JOB==""] <- "Unknown"
eval_df$JOB[eval_df$JOB==""] <- "Unknown"

# Convert currency columns from character class to integer
dollar_vars <- c("INCOME","HOME_VAL","BLUEBOOK","OLDCLAIM")
for (v in dollar_vars) {
  train_df <- train_df %>% mutate(!!v := str_replace(get(v),"[\\$,]",""))
  eval_df <- eval_df %>% mutate(!!v := str_replace(get(v),"[\\$,]",""))
}

currency_vars <- c("INCOME","HOME_VAL","BLUEBOOK","OLDCLAIM")

for (v in currency_vars) {
  train_df <- train_df %>% mutate(!!v := parse_number(get(v)))
  eval_df <- eval_df %>% mutate(!!v := parse_number(get(v)))
}


## -------------------------------------------------------------------------------------------------------------------------------------------------------------
dist_df <- train_df %>% dplyr::select(where(is.numeric)) %>%
  pivot_longer(!c(TARGET_FLAG,TARGET_AMT), 
               names_to = 'variable', 
               values_to = 'value') %>% 
  drop_na()

dist_df %>% ggplot(aes(x=value, group=TARGET_FLAG, fill=TARGET_FLAG)) +
  geom_density(color='#023020') + 
  facet_wrap(~variable, scales = 'free',  ncol = 4) + 
  theme_bw()


## ----discrete_plot--------------------------------------------------------------------------------------------------------------------------------------------
train_df[,c("HOMEKIDS","KIDSDRIV", "CLM_FREQ")] %>%
  gather("variable","value") %>%
  group_by(variable) %>%
  count(value) %>%
  mutate(value = factor(value,levels=5:0)) %>%
  mutate(percent = n*100/8161) %>%
  ggplot(.,
  aes(variable,percent)) +
  geom_bar(stat = "identity", aes(fill = value)) +
  xlab("Variable") +
  ylab("Percentage") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
  scale_fill_manual(values = rev(c("#003f5c","#58508d", "#bc5090", "#ff6361", "#ffa600","#CC79a7")))


## ----vars_to_factors------------------------------------------------------------------------------------------------------------------------------------------
factor_vars <- c("PARENT1","CAR_TYPE","JOB","CAR_USE",
                 "URBANICITY","RED_CAR","REVOKED","MSTATUS","EDUCATION","SEX")

for (v in factor_vars) {
  train_df <- train_df %>% mutate(!!v := factor(get(v)))
  eval_df <- eval_df %>% mutate(!!v := factor(get(v)))
}

train_df$TARGET_FLAG <- as.factor(train_df$TARGET_FLAG)


## -------------------------------------------------------------------------------------------------------------------------------------------------------------
train_df[,factor_vars] %>%
  gather("variable","value") %>%
  group_by(variable) %>%
  count(value) %>%
  mutate(value = factor(value)) %>%
  mutate(percent = n*100/8161) %>%
  ggplot(.,
  aes(variable,percent)) +
  geom_bar(stat = "identity", aes(fill = value)) +
  xlab("Variable") +
  ylab("Percent") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))


## ----boxplot--------------------------------------------------------------------------------------------------------------------------------------------------
dist_df %>% ggplot(aes(x=value, group=TARGET_FLAG, fill=TARGET_FLAG)) + 
  geom_boxplot(color='#023020') + 
  facet_wrap(~variable, scales = 'free',  ncol = 4) + 
  theme_bw()


## ----scatterplot----------------------------------------------------------------------------------------------------------------------------------------------
dist_df %>% drop_na() %>% ggplot(aes(x=value, y=TARGET_AMT)) + 
  geom_smooth(method='glm', se=TRUE, na.rm=TRUE) +
  geom_point(color='#023020', alpha=0.5) + 
  facet_wrap(~variable, scales = 'free',  ncol = 4) + 
  theme_bw()


## ----corr-----------------------------------------------------------------------------------------------------------------------------------------------------
rcore <- rcorr(as.matrix(train_df %>% dplyr::select(where(is.numeric))))
coeff <- rcore$r
corrplot(coeff, tl.cex = .5, tl.col="black", method = 'color', addCoef.col = "black",
         type="upper", order="hclust", number.cex=0.7, diag=FALSE)


## -------------------------------------------------------------------------------------------------------------------------------------------------------------
## Check JOB values where YOJ==0 and INCOME is NA
# train_df %>% filter(YOJ == 0 & is.na(INCOME)) %>% dplyr::select(JOB) %>% unique()

# Manual correction of missing/outlier values
train_df <- train_df %>%
  mutate(CAR_AGE = ifelse(CAR_AGE < 0, 0, CAR_AGE)) %>% 
  mutate(INCOME = ifelse(YOJ == 0 & is.na(INCOME), 0, INCOME))

eval_df <- eval_df %>% 
  mutate(INCOME = ifelse(YOJ == 0 & is.na(INCOME), 0, INCOME))


## ----impute---------------------------------------------------------------------------------------------------------------------------------------------------
# MICE imputation of missing values
imputed_Data <- mice(train_df, m=5, maxit = 20, method = 'pmm', seed = 500, printFlag=FALSE)
train_df <- complete(imputed_Data)

eval_imputed_Data <- mice(eval_df, m=5, maxit = 20, method = 'pmm', seed = 500, printFlag=FALSE)
eval_df <- complete(eval_imputed_Data)


## ----dummy_vars-----------------------------------------------------------------------------------------------------------------------------------------------
## Dummy Variables for factors with two levels
dummy_vars <- function(df){
  df %>%
    mutate(
      MALE = factor(ifelse(SEX == "M", 1, 0)), 
      MARRIED = factor(ifelse(MSTATUS == "Yes", 1, 0)),
      LIC_REVOKED = factor(ifelse(REVOKED == "Yes", 1, 0)),
      CAR_RED = factor(ifelse(RED_CAR == "Yes", 1, 0)),
      PRIVATE_USE = factor(ifelse(CAR_USE == "Private", 1, 0)),
      SINGLE_PARENT = factor(ifelse(PARENT1 == "Yes", 1, 0)),
      URBAN = factor(ifelse(URBANICITY == "Highly Urban/ Urban", 1, 0)),
      KIDSDRIV = factor(ifelse(KIDSDRIV > 0, 1, 0)),
      HOMEKIDS = factor(ifelse(HOMEKIDS > 0, 1, 0))
    ) %>% 
    dplyr::select(-c(SEX, MSTATUS, REVOKED, RED_CAR, CAR_USE, PARENT1, URBANICITY))
      }

train_df <- dummy_vars(train_df)
eval_df <- dummy_vars(eval_df)


## -------------------------------------------------------------------------------------------------------------------------------------------------------------
# keep original datasets for further lasso models, use transformed df for further transformations
train_transformed <- train_df
eval_transformed <- eval_df


## ----bin_vars-------------------------------------------------------------------------------------------------------------------------------------------------
## bin TIF, CAR_AGE, HOME_VAL
q <- quantile(train_transformed$CAR_AGE)
q <- c(-1,  1,  8, 12, 28)
train_transformed <- train_transformed %>% 
  mutate(CAR_AGE_BIN = cut(CAR_AGE, breaks=q, labels=FALSE))

q <- quantile(train_transformed$HOME_VAL)
q[1] <- -1
train_transformed <- train_transformed%>% 
  mutate(HOME_VAL_BIN = cut(HOME_VAL, breaks=q, labels=FALSE))

q <- quantile(train_transformed$TIF) 
q[1] <- -1
train_transformed <- train_transformed %>% 
  mutate(TIF_BIN = cut(TIF, breaks=q, labels=FALSE))

q <- quantile(eval_transformed$CAR_AGE)
q <- c(-1,  1,  8, 12, 28)
eval_transformed <- eval_transformed %>% 
  mutate(CAR_AGE_BIN = cut(CAR_AGE, breaks=q, labels=FALSE))

q <- quantile(eval_transformed$HOME_VAL)
q[1] <- -1
eval_transformed <- eval_transformed%>% 
  mutate(HOME_VAL_BIN = cut(HOME_VAL, breaks=q, labels=FALSE))

q <- quantile(eval_transformed$TIF) 
q[1] <- -1
eval_transformed <- eval_transformed %>% 
  mutate(TIF_BIN = cut(TIF, breaks=q, labels=FALSE))


## ----boxcox, fig.keep="none"----------------------------------------------------------------------------------------------------------------------------------
# use Box-Cox on BLUEBOOK, TRAVTIME, TIF
bluebook_boxcox <- boxcox(lm(train_transformed$BLUEBOOK ~ 1))
bluebook_lambda <- bluebook_boxcox$x[which.max(bluebook_boxcox$y)]
bluebook_trans <- BoxCox(train_transformed$BLUEBOOK, bluebook_lambda)
train_transformed$BLUEBOOK <- bluebook_trans

bluebook_boxcox <- boxcox(lm(eval_transformed$BLUEBOOK ~ 1))
bluebook_lambda <- bluebook_boxcox$x[which.max(bluebook_boxcox$y)]
bluebook_trans <- BoxCox(eval_transformed$BLUEBOOK, bluebook_lambda)
eval_transformed$BLUEBOOK <- bluebook_trans

travtime_boxcox <- boxcox(lm(train_transformed$TRAVTIME ~ 1))
travtime_lambda <- travtime_boxcox$x[which.max(travtime_boxcox$y)]
travtime_trans <- BoxCox(train_transformed$TRAVTIME, travtime_lambda)
train_transformed$TRAVTIME <- travtime_trans

travtime_boxcox <- boxcox(lm(eval_transformed$TRAVTIME ~ 1))
travtime_lambda <- travtime_boxcox$x[which.max(travtime_boxcox$y)]
travtime_trans <- BoxCox(eval_transformed$TRAVTIME, travtime_lambda)
eval_transformed$TRAVTIME <- travtime_trans

tif_boxcox <- boxcox(lm(train_transformed$TIF ~ 1))
tif_lambda <- tif_boxcox$x[which.max(tif_boxcox$y)]
tif_trans <- BoxCox(train_transformed$TIF, tif_lambda)
train_transformed$TIF <- tif_trans

tif_boxcox <- boxcox(lm(eval_transformed$TIF ~ 1))
tif_lambda <- tif_boxcox$x[which.max(tif_boxcox$y)]
tif_trans <- BoxCox(eval_transformed$TIF, tif_lambda)
eval_transformed$TIF <- tif_trans

## Use basic log transformation on TARGET_AMT, OLDCLAIM
train_transformed <- train_transformed %>% 
  mutate(TARGET_AMT = case_when(TARGET_FLAG==1 ~ log(TARGET_AMT) , TARGET_FLAG==0 ~ TARGET_AMT))
train_transformed$OLDCLAIM <- log(train_transformed$OLDCLAIM + 1)
train_transformed$INCOME <- log(train_transformed$INCOME + 1)

eval_transformed$OLDCLAIM <- log(eval_transformed$OLDCLAIM + 1)
eval_transformed$INCOME <- log(eval_transformed$INCOME + 1)


## -------------------------------------------------------------------------------------------------------------------------------------------------------------
skim_without_charts(train_transformed) %>%
  dplyr::select(c(skim_variable, skim_type, n_missing, complete_rate,
                  numeric.p0, numeric.p100, numeric.mean)) %>%
  nice_table(cap='Summary: Transformed Training Dataset',cols=c('variable','type','missing',
                                    'complete%','min','max','mean'))


## ----box_hists------------------------------------------------------------------------------------------------------------------------------------------------
## Create temporary CAR_CRASH variable from the target_flag for visualization
vis_df <- train_transformed %>% 
  mutate(CAR_CRASH = ifelse(TARGET_FLAG == 1, 1, 0)) %>% 
  dplyr::select(where(is.numeric)) %>%
  dplyr::select(!c(CAR_AGE_BIN, HOME_VAL_BIN, TIF_BIN)) %>% # exclude binned
  pivot_longer(!c(CAR_CRASH,TARGET_AMT), 
               names_to = 'variable', 
               values_to = 'value') %>% 
  drop_na()

vis_df %>% ggplot(aes(x=value, group=CAR_CRASH, fill=CAR_CRASH)) +
  geom_density(color='#023020') +
  facet_wrap(~variable, scales = 'free',  ncol = 4) +
  theme_bw()


## ----split----------------------------------------------------------------------------------------------------------------------------------------------------
set.seed(1233)

# For Lasso Logistic model, original data without NAs, some vars to factors and opposite
log_original <- createDataPartition(train_df$TARGET_FLAG, p=0.75, list = FALSE)

train_data <- train_df[log_original, ] %>% dplyr::select(-c(TARGET_AMT))
valid_data <- train_df[-log_original, ] %>% dplyr::select(-c(TARGET_AMT))

# For other logistic models, transformed data, removed bin variables
log_transformed <- createDataPartition(train_transformed$TARGET_FLAG, p=0.75, list = FALSE)

log_train <- train_transformed[log_transformed,] %>% 
  dplyr::select(-c(TARGET_AMT, TIF, CAR_AGE, HOME_VAL))

log_test <- train_transformed[-log_transformed,] %>% 
  dplyr::select(-c(TARGET_AMT, TIF, CAR_AGE, HOME_VAL))

# For Lasso Linear model, original data without NAs, some vars to factors and opposite, only for TARGET_FLAG=1
original_yes_crash <- train_df %>% 
  filter(TARGET_FLAG==1)

transformed_yes_crash <- train_transformed %>% 
  filter(TARGET_FLAG==1) 

linear_original <- createDataPartition(original_yes_crash$TARGET_AMT, p=0.75, list = FALSE)

linear_train_data <- original_yes_crash[linear_original, ] %>% 
  dplyr::select(-c(TARGET_FLAG))

linear_valid_data <- original_yes_crash[-linear_original, ] %>% 
  dplyr::select(-c(TARGET_FLAG))

# For other linear models, transformed data, removed bin variables
linear_transformed <- createDataPartition(transformed_yes_crash$TARGET_AMT, p=0.75, list = FALSE)

amt_train <- transformed_yes_crash[linear_transformed, ] %>% 
  dplyr::select(-c(TARGET_FLAG, TIF, CAR_AGE, HOME_VAL))

amt_test <- transformed_yes_crash[-linear_transformed, ] %>% 
  dplyr::select(-c(TARGET_FLAG, TIF, CAR_AGE, HOME_VAL))

eval_transformed <- eval_transformed %>% 
  dplyr::select(-c(TIF, CAR_AGE, HOME_VAL))


## ----log_model1, warning=FALSE, message=FALSE-----------------------------------------------------------------------------------------------------------------
log_model_1 <- glm(train_data, formula = TARGET_FLAG ~ ., family = binomial(link = "logit"))


## ----log_model1_aic, warning=FALSE, message=FALSE-------------------------------------------------------------------------------------------------------------
log_model_1_aic <- log_model_1 %>% stepAIC(trace = FALSE)
summ(log_model_1_aic)


## -------------------------------------------------------------------------------------------------------------------------------------------------------------
as.data.frame(summ(log_model_1_aic)$coeftable) %>%
  dplyr::select(Est.,p) %>%
  filter(p < 0.06 & Est. > 0.2) %>%
  arrange(desc(Est.)) %>%
  nice_table(cap='Positive Coefficients', cols=c('Est','p'))


## -------------------------------------------------------------------------------------------------------------------------------------------------------------
as.data.frame(summ(log_model_1_aic)$coeftable) %>%
  dplyr::select(Est.,p) %>%
  filter(p < 0.06 & Est. < -0.2) %>%
  arrange(Est.) %>%
  nice_table(cap='Negative Coefficients', cols=c('Est','p'))


## ----log_model_1_confusionMatrix, fig.height=4----------------------------------------------------------------------------------------------------------------
log_testing <- valid_data

log1_pred <- predict.glm(log_model_1_aic, log_testing, "response")

log_testing$prob_model_1 <- log1_pred
log_testing$pred_model_1 <- ifelse(log1_pred >= 0.5, 1, 0)

# Confusion Matrix
log_matrix_1 <- confusionMatrix(factor(log_testing$pred_model_1), 
                                factor(log_testing$TARGET_FLAG), "1")

log_matrix_1

# Table
results_log <- tibble(
  Model = "Model #1: Original data", 
  Accuracy=log_matrix_1$overall["Accuracy"],
  "Classification error rate" = 1 - log_matrix_1$overall["Accuracy"],
  F1 = log_matrix_1$byClass[7],
  Deviance= log_model_1_aic$deviance,
  R2 = 1 - log_model_1_aic$deviance / log_model_1_aic$null.deviance,
  Sensitivity = log_matrix_1$byClass["Sensitivity"],
  Specificity = log_matrix_1$byClass["Specificity"],
  Precision = precision(factor(log_testing$pred_model_1), 
                        factor(log_testing$TARGET_FLAG), "1"),
  AUC = auc(log_testing$TARGET_FLAG, log_testing$pred_model_1),
  AIC= log_model_1_aic$aic
  )

# ROC Curve
plot(roc(log_testing$TARGET_FLAG, 
         log_testing$pred_model_1, 
         plot = FALSE, print.auc = TRUE), 
     quiet=TRUE, main="Model 1 - ROC Curve")


## ----check_log1-----------------------------------------------------------------------------------------------------------------------------------------------
par(mfrow=c(2,2))
plot(log_model_1_aic)


## ----model_1_residuals, fig.height=4--------------------------------------------------------------------------------------------------------------------------
log_model_1_aic.data <- augment(log_model_1_aic) %>% 
  mutate(index = 1:n())

ggplot(log_model_1_aic.data, aes(index, .std.resid)) + 
  geom_point(aes(color = TARGET_FLAG), alpha = .5) +
  theme_bw()


## ----warning=FALSE, message=FALSE-----------------------------------------------------------------------------------------------------------------------------
car::mmps(log_model_1_aic, span = 3/4, layout = c(2, 2))


## ----log_model_1_vif------------------------------------------------------------------------------------------------------------------------------------------
as.data.frame(car::vif(log_model_1_aic)) %>% 
  arrange(desc(GVIF)) %>%
  nice_table(cap='Variance Inflation Factor', cols=c('GVIF','df','adj'))


## ----general_box, warning=FALSE, message=FALSE----------------------------------------------------------------------------------------------------------------
log_model_2 <- glm(log_train, formula = TARGET_FLAG ~ . , family = binomial(link = "logit"))


## ----log_model2_aic, warning=FALSE, message=FALSE-------------------------------------------------------------------------------------------------------------
log_model_2_aic <- log_model_2 %>% stepAIC(trace = FALSE)
summ(log_model_2_aic)


## -------------------------------------------------------------------------------------------------------------------------------------------------------------
as.data.frame(summ(log_model_2_aic)$coeftable) %>%
  dplyr::select(Est.,p) %>%
  filter(p < 0.06 & Est. > 0.2) %>%
  arrange(desc(Est.)) %>%
  nice_table(cap='Positive Coefficients')


## -------------------------------------------------------------------------------------------------------------------------------------------------------------
as.data.frame(summ(log_model_2_aic)$coeftable) %>%
  dplyr::select(Est.,p) %>%
  filter(p < 0.06 & Est. < -0.2) %>%
  arrange(Est.) %>%
  nice_table(cap='Negative Coefficients')


## ----log_model_2_confusionMatrix, fig.height=4----------------------------------------------------------------------------------------------------------------
testing2 <- log_test
#testing2$model_2 <- ifelse(predict.glm(log_model_2_aic, testing2 , "response") >= 0.5, 1, 0)

log2_pred <- predict.glm(log_model_2_aic, testing2, "response")

testing2$prob_model_2  <- log2_pred
testing2$pred_model_2 <- ifelse(log2_pred >= 0.5, 1, 0)

log_matrix_2 <- confusionMatrix(factor(testing2$pred_model_2), factor(testing2$TARGET_FLAG), "1")

results_log <- rbind(results_log, tibble(Model = "Model #2: Transformed data", Accuracy=log_matrix_2$overall["Accuracy"], 
                  "Classification error rate" = 1 - log_matrix_2$overall["Accuracy"],
                  F1 = log_matrix_2$byClass[7],
                  Deviance= log_model_2_aic$deviance, 
                  R2 = 1 - log_model_2_aic$deviance / log_model_2_aic$null.deviance,
                  Sensitivity = log_matrix_2$byClass["Sensitivity"],
                  Specificity = log_matrix_2$byClass["Specificity"],
                  Precision = precision(factor(testing2$pred_model_2), factor(testing2$TARGET_FLAG), "1"),
                  AUC = auc(testing2$TARGET_FLAG, testing2$pred_model_2),
                  AIC= log_model_2_aic$aic))

log_matrix_2

plot(roc(testing2$TARGET_FLAG, testing2$pred_model_2, plot = FALSE, print.auc = TRUE), 
     quiet=TRUE, main="Model 2 - ROC Curve")


## ----check_log2-----------------------------------------------------------------------------------------------------------------------------------------------
par(mfrow=c(2,2))
plot(log_model_2_aic)


## ----model_2_residuals, fig.height=4--------------------------------------------------------------------------------------------------------------------------
log_model_2_aic.data <- augment(log_model_2_aic) %>% 
  mutate(index = 1:n())

ggplot(log_model_2_aic.data, aes(index, .std.resid)) + 
  geom_point(aes(color = TARGET_FLAG), alpha = .5) +
  theme_bw()


## ----warning=FALSE, message=FALSE-----------------------------------------------------------------------------------------------------------------------------
car::mmps(log_model_2_aic, span = 3/4, layout = c(2, 2))


## ----log_model_2_vif------------------------------------------------------------------------------------------------------------------------------------------
as.data.frame(car::vif(log_model_2_aic)) %>% 
  arrange(desc(GVIF)) %>%
  nice_table(cap='Variance Inflation Factor', cols=c('GVIF','df','adj'))


## -------------------------------------------------------------------------------------------------------------------------------------------------------------
t_df <- train_data

# build X matrix and Y vector
X <- model.matrix(TARGET_FLAG ~ ., data=t_df)[,-1]
Y <- t_df[,"TARGET_FLAG"] 


## -------------------------------------------------------------------------------------------------------------------------------------------------------------
lasso.log.model<- cv.glmnet(
  x=X,y=Y,
  family = "binomial",
  link = "logit",
  standardize = TRUE,
  nfold = 5,
  alpha=1) # alpha=1 is lasso

lasso.log.model

l.min <- lasso.log.model$lambda.min
l.1se <- lasso.log.model$lambda.1se


## -------------------------------------------------------------------------------------------------------------------------------------------------------------
# drop TARGET_AMT
t0_df <- valid_data 

# build X matrix and Y vector
X_test <- model.matrix(TARGET_FLAG ~ ., data=t0_df)[,-1]
Y_test <- t0_df[,"TARGET_FLAG"] 

# predict using coefficients at lambda.min
lassoPred <- predict(lasso.log.model, newx = X_test, type = "response", s = 'lambda.min')

pred_log_df <- valid_data %>% drop_na()
pred_log_df$TARGET_FLAG_PROB <- lassoPred[,1]
pred_log_df$TARGET_FLAG_PRED <- ifelse(lassoPred > 0.5, 1, 0)[,1]


## -------------------------------------------------------------------------------------------------------------------------------------------------------------
confusion.glmnet(lasso.log.model, newx = X_test, newy = Y_test, s = 'lambda.min')


## -------------------------------------------------------------------------------------------------------------------------------------------------------------
assess.glmnet(lasso.log.model,           
                newx = X,              
                newy = Y,
                family = "binomial",
                s = 'lambda.min'
              )    

print(glmnet_cv_aicc(lasso.log.model, 'lambda.min'))
print(glmnet_cv_aicc(lasso.log.model, 'lambda.1se'))


## -------------------------------------------------------------------------------------------------------------------------------------------------------------

conf_matrix_3 = confusionMatrix(factor(pred_log_df$TARGET_FLAG_PRED),factor(pred_log_df$TARGET_FLAG), "1")

# build X matrix and Y vector
t0_df <- valid_data %>% drop_na()
X_test <- model.matrix(TARGET_FLAG ~ ., data=t0_df)[,-1]
Y_test <- t0_df[,"TARGET_FLAG"] 

lasso.r1 <- assess.glmnet(lasso.log.model,           
                                newx = X_test,              
                                newy = Y_test,
                                family = "binomial",
                                s = 'lambda.min'
                          )   
lasso.r2 <- glmnet_cv_aicc(lasso.log.model, 'lambda.min')

results_log <- rbind(results_log,tibble(Model = "Model #3: Lasso", Accuracy=conf_matrix_3$overall[1], 
                  "Classification error rate" = 1 - conf_matrix_3$overall[1],
                  F1 = conf_matrix_3$byClass[7],
                  Deviance=lasso.r1$deviance[[1]], 
                  R2 = NA,
                  Sensitivity = conf_matrix_3$byClass["Sensitivity"],
                  Specificity = conf_matrix_3$byClass["Specificity"],
                  Precision = conf_matrix_3$byClass["Precision"],
                  AUC=lasso.r1$auc[1], 
                  AIC= lasso.r2$AICc))


## -------------------------------------------------------------------------------------------------------------------------------------------------------------
par(mfrow=c(2,2))

plot(lasso.log.model)
plot(lasso.log.model$glmnet.fit, xvar="lambda", label=TRUE)
plot(lasso.log.model$glmnet.fit, xvar='dev', label=TRUE)

rocs <- roc.glmnet(lasso.log.model, newx = X, newy = Y )
plot(rocs,type="l")  


## -------------------------------------------------------------------------------------------------------------------------------------------------------------
as.data.frame(as.matrix(coef(lasso.log.model, s = "lambda.min"))) %>%
  arrange(desc(s1)) %>%
  nice_table(cap='Model Coefficients', cols=c('Est'))


## ---- fig.height=4--------------------------------------------------------------------------------------------------------------------------------------------
vip(lasso.log.model, num_features=20 ,geom = "col", include_type=TRUE, lambda = "lambda.min")

coeffs.table <- coeff2dt(fitobject = lasso.log.model, s = "lambda.min")

coeffs.table %>% mutate(name = fct_reorder(name, desc(coefficient))) %>%
ggplot() +
  geom_col(aes(y = name, x = coefficient, fill = {coefficient > 0})) +
  xlab(label = "") +
  ggtitle(expression(paste("Lasso Coefficients with ", lambda, " = 0.0275"))) +
  theme_bw() +
  theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust=0.5),legend.position = "none")


## -------------------------------------------------------------------------------------------------------------------------------------------------------------
results_log %>% 
  nice_table(cap='Logistic Model Comparison') %>% 
  scroll_box(width='100%')


## ----model_1--------------------------------------------------------------------------------------------------------------------------------------------------
df_train <- linear_train_data
df_test <- linear_valid_data

# build model
model_1 <- lm(TARGET_AMT ~ ., data = df_train)


## ----model_1_aic----------------------------------------------------------------------------------------------------------------------------------------------
# select features and refit by stepwise selection (AIC)
model_1_aic <- model_1 %>% stepAIC(trace = FALSE)
summ(model_1_aic, digits = getOption("jtools-digits", 4))


## ---- fig.height=4--------------------------------------------------------------------------------------------------------------------------------------------
# validate and calculate RMSE
model_1_aic.valid <- predict(model_1_aic, newdata = df_test)
model_1_aic.eval <- bind_cols(target = df_test$TARGET_AMT, predicted=unname(model_1_aic.valid))
model_1_aic.rmse <- sqrt(mean((model_1_aic.eval$target - model_1_aic.eval$predicted)^2)) 

# plot targets vs predicted
model_1_aic.eval %>%
  ggplot(aes(x = target, y = predicted)) +
  geom_point(alpha = .3) +
  geom_smooth(method="lm", color='grey', alpha=.3, se=FALSE) +
  labs(title=paste('RMSE:',round(model_1_aic.rmse,1)))


## -------------------------------------------------------------------------------------------------------------------------------------------------------------
multi_metric <- metric_set(mape, smape, mase, mpe, yardstick::rmse)
model1_df <- model_1_aic.eval %>% multi_metric(truth=target, estimate=predicted)
a <- summary(model_1_aic)


## -------------------------------------------------------------------------------------------------------------------------------------------------------------
results_lm_tbl <- tibble(
                      Model = character(),
                      mape = numeric(), 
                      smape = numeric(), 
                      mase = numeric(), 
                      mpe = numeric(), 
                      "RMSE" = numeric(),
                      AIC = numeric(),
                      "Adjusted R2" = numeric(),
                      "F-statistic" = numeric()
                )

results_lm_tbl <- results_lm_tbl %>% add_row(tibble_row(
                      Model = "Model 1: Original data",
                      mape = model1_df[[1,3]],
                      smape = model1_df[[2,3]],
                      mase = model1_df[[3,3]],
                      mpe = model1_df[[4,3]],
                      "RMSE" = model1_df[[5,3]],
                      AIC = AIC(model_1_aic)),
                      "Adjusted R2" = a$adj.r.squared,
                      "F-statistic" = a$fstatistic[1]
                     )


## ----check_lm1------------------------------------------------------------------------------------------------------------------------------------------------
check_model(model_1_aic, check=c('ncv','qq','homogeneity','outliers'))


## ----vif_lm1, fig.height=4------------------------------------------------------------------------------------------------------------------------------------
vif_values <- vif(model_1_aic)
vif_values <- rownames_to_column(as.data.frame(vif_values), var = "var")

vif_values %>%
  ggplot(aes(y=vif_values, x=var)) +
  coord_flip() + 
  geom_hline(yintercept=5, linetype="dashed", color = "red") +
  geom_bar(stat = 'identity', width=0.3 ,position=position_dodge()) 


## ----model_2--------------------------------------------------------------------------------------------------------------------------------------------------
# build model
model_2 <- lm(TARGET_AMT ~ ., data = amt_train)


## ----model_2_aic----------------------------------------------------------------------------------------------------------------------------------------------
# select features and refit by stepwise selection (AIC)
model_2_aic <- model_2 %>% stepAIC(trace = FALSE)
summ(model_2_aic, digits = getOption("jtools-digits", 4))


## ----fig.height=4---------------------------------------------------------------------------------------------------------------------------------------------
# validate and calculate RMSE
model_2_aic.valid <- predict(model_2_aic, newdata = amt_test)
model_2_aic.eval <- bind_cols(target = amt_test$TARGET_AMT, predicted=model_2_aic.valid)
model_2_aic.rmse <- sqrt(mean((model_2_aic.eval$target - model_2_aic.eval$predicted)^2)) 

# plot targets vs predicted
model_2_aic.eval %>%
  ggplot(aes(x = target, y = predicted)) +
  geom_point(alpha = .3) +
  geom_smooth(method="lm", color='grey', alpha=.3, se=FALSE) +
  labs(title=paste('RMSE:',round(model_2_aic.rmse,1)))


## -------------------------------------------------------------------------------------------------------------------------------------------------------------
model2_df <-model_2_aic.eval %>% multi_metric(truth=target, estimate=predicted)

b <- summary(model_2_aic)

results_lm_tbl <- results_lm_tbl %>% add_row(tibble_row(
                      Model = "Model 2: Transforms + Stepwise",
                      mape = model2_df[[1,3]],
                      smape = model2_df[[2,3]],
                      mase = model2_df[[3,3]],
                      mpe = model2_df[[4,3]],
                      "RMSE" = model2_df[[5,3]],
                      AIC = AIC(model_2_aic)),
                      "Adjusted R2" = b$adj.r.squared,
                      "F-statistic" = b$fstatistic[1]
                     )


## ----check_lm2------------------------------------------------------------------------------------------------------------------------------------------------
check_model(model_2_aic, check=c('ncv','qq','homogeneity','outliers'))


## ----vif_lm2, fig.height=4------------------------------------------------------------------------------------------------------------------------------------

vif_values <- vif(model_2_aic)
vif_values <- rownames_to_column(as.data.frame(vif_values), var = "var")

vif_values %>%
  ggplot(aes(y=GVIF, x=var)) +
  coord_flip() + 
  geom_hline(yintercept=5, linetype="dashed", color = "red") +
  geom_bar(stat = 'identity', width=0.3 ,position=position_dodge()) 


## -------------------------------------------------------------------------------------------------------------------------------------------------------------
# filter the target data
t_df <- linear_train_data 

# build X matrix and Y vector
X <- model.matrix(TARGET_AMT ~ ., data=t_df)[,-1]
Y <- t_df[,"TARGET_AMT"] 


## -------------------------------------------------------------------------------------------------------------------------------------------------------------
lasso.lm.model <- cv.glmnet(
  x=X,y=Y, # Y already logged in prep
  family = "gaussian",
  type.measure="mse",
  standardize = TRUE, # standardize
  nfold = 10,
  alpha=1) # alpha=1 is lasso

lasso.lm.model

l.min <- lasso.lm.model$lambda.min
l.1se <- lasso.lm.model$lambda.1se


## -------------------------------------------------------------------------------------------------------------------------------------------------------------
par(mfrow=c(2,2))

plot(lasso.lm.model)
plot(lasso.lm.model$glmnet.fit, xvar="lambda", label=TRUE)
plot(lasso.lm.model$glmnet.fit, xvar='dev', label=TRUE)


## -------------------------------------------------------------------------------------------------------------------------------------------------------------
assess.glmnet(lasso.lm.model,           
              newx = X,              
              newy = Y,
              s = 'lambda.min',
              family = 'gaussian')    

print(glmnet_cv_aicc(lasso.lm.model, 'lambda.min'))
print(glmnet_cv_aicc(lasso.lm.model, 'lambda.1se'))


## -------------------------------------------------------------------------------------------------------------------------------------------------------------
as.data.frame(as.matrix(coef(lasso.lm.model, s = "lambda.min"))) %>%
  arrange(desc(s1)) %>%
  nice_table(cap='Model Coefficients', cols='Est')


## ----fig.height=3---------------------------------------------------------------------------------------------------------------------------------------------
vip(lasso.lm.model, num_features=20 ,geom = "col", include_type=TRUE, lambda = "lambda.min")

coeffs.table <- coeff2dt(fitobject = lasso.lm.model, s = "lambda.min")

coeffs.table %>% mutate(name = fct_reorder(name, desc(coefficient))) %>%
ggplot() +
  geom_col(aes(y = name, x = coefficient, fill = {coefficient > 0})) +
  xlab(label = "") +
  ggtitle(expression(paste("Lasso Coefficients with ", lambda, " = 0.0275"))) +
  theme_bw() +
  theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust=0.5),legend.position = "none")


## -------------------------------------------------------------------------------------------------------------------------------------------------------------
t0_df <- linear_valid_data 
X_test <- model.matrix(TARGET_AMT ~ ., data=t0_df)[,-1]
Y_test <- t0_df[,"TARGET_AMT"] 

lassoPred <- predict(
  lasso.lm.model, 
  newx = X_test,
  type = "response",
  s = 'lambda.min')

pred_ln_df <- linear_valid_data
pred_ln_df$TARGET_AMT_PRED_RAW <- lassoPred[,1]
pred_ln_df$TARGET_AMT_PRED <- ifelse(lassoPred > 100, lassoPred, 0)[,1]


## -------------------------------------------------------------------------------------------------------------------------------------------------------------
pred_ln_df %>% multi_metric(truth=TARGET_AMT, estimate=TARGET_AMT_PRED)


## ----fig.height=3---------------------------------------------------------------------------------------------------------------------------------------------
n <- nrow(t0_df)
x <- t0_df$TARGET_AMT
e <- lassoPred - t0_df$TARGET_AMT

plot(x, e,  
     xlab = "cost", 
     ylab = "residuals", 
     bg = "steelblue", 
     col = "darkgray", cex = 1.5, pch = 21, frame = FALSE)
abline(h = 0, lwd = 2)
for (i in 1 : n) 
  lines(c(x[i], x[i]), c(e[i], 0), col = "red" , lwd = 1)

# comparative histogram
t0_df <- pred_ln_df %>% dplyr::select(TARGET_AMT,TARGET_AMT_PRED) %>% pivot_longer(c(TARGET_AMT,TARGET_AMT_PRED),
                                 names_to='variable' , values_to = 'value')

t0_df %>% dplyr::filter(value <= 1000) %>% 
  ggplot(aes(x=value, fill=variable)) +geom_histogram(bins=5) 

t0_df %>% dplyr::filter(value > 1000) %>% 
  ggplot(aes(x=value, fill=variable)) +geom_histogram(bins=100) 


## -------------------------------------------------------------------------------------------------------------------------------------------------------------
# build X matrix and Y vector
t0_df <- linear_valid_data 
X_test <- model.matrix(TARGET_AMT ~ ., data=t0_df)[,-1]
Y_test <- t0_df[,"TARGET_AMT"] 

lasso.r1 <- assess.glmnet(lasso.lm.model,
                                newx = X_test,
                                newy = Y_test )
lasso.r2 <- glmnet_cv_aicc(lasso.lm.model, 'lambda.min')


## -------------------------------------------------------------------------------------------------------------------------------------------------------------
m_df <- pred_ln_df %>% multi_metric(truth=TARGET_AMT, estimate=TARGET_AMT_PRED)


## -------------------------------------------------------------------------------------------------------------------------------------------------------------
results_lm_tbl <- results_lm_tbl %>% add_row(tibble_row(
                      Model = "Model 3: Lasso",
                      mape = m_df[[1,3]],
                      smape = m_df[[2,3]],
                      mase = m_df[[3,3]],
                      mpe = m_df[[4,3]],
                      "RMSE" = m_df[[5,3]],
                      AIC = lasso.r2$AICc,
                      "Adjusted R2" = NA,
                      "F-statistic" = NA
                     ))


## -------------------------------------------------------------------------------------------------------------------------------------------------------------
pred_ln_df$PER_ERR <- (pred_ln_df$TARGET_AMT_PRED - pred_ln_df$TARGET_AMT)


## ----fig.height=4---------------------------------------------------------------------------------------------------------------------------------------------
pred_ln_df %>% ggplot(aes(x=TARGET_AMT, colour=TARGET_AMT_PRED, y=PER_ERR)) +
  geom_point()


## -------------------------------------------------------------------------------------------------------------------------------------------------------------
results_lm_tbl %>% 
  nice_table(cap='Linear Model Comparison') 


## -------------------------------------------------------------------------------------------------------------------------------------------------------------
# build X matrix and Y vector
eval_df$TARGET_FLAG <- 0
X_eval <- model.matrix(TARGET_FLAG ~ ., data=eval_df)[,-1]


## -------------------------------------------------------------------------------------------------------------------------------------------------------------
# predict using coefficients at lambda.min
lassoPred <- predict(lasso.log.model, newx = X_eval, type = "response", s = 'lambda.min')


## -------------------------------------------------------------------------------------------------------------------------------------------------------------
eval_pred_df <- eval_df
eval_pred_df$TARGET_FLAG_PROB <- lassoPred[,1]
eval_pred_df$TARGET_FLAG_PRED <- ifelse(lassoPred > 0.5, 1, 0)[,1]
eval_pred_df <- subset(eval_pred_df, select = -c(TARGET_FLAG))


## -------------------------------------------------------------------------------------------------------------------------------------------------------------
eval_transformed$TARGET_FLAG_PRED <- eval_pred_df$TARGET_FLAG_PRED
eval_transformed$TARGET_FLAG_PROB <- round(eval_pred_df$TARGET_FLAG_PROB, digits=4)

# from predictions by model_2_aic we get ln(TARGET_AMT), to get TARGET_AMT, we take exp() of the predicted values
eval_transformed <- eval_transformed %>% 
  mutate(TARGET_AMT_PRED = case_when(TARGET_FLAG_PRED==1 ~ exp(predict(object = model_2_aic, newdata=eval_transformed)), TARGET_FLAG_PRED==0 ~ 0))

eval_transformed$TARGET_AMT_PRED <- round(eval_transformed$TARGET_AMT_PRED, digits=4)

# reorder predictions to front
eval_transformed <- eval_transformed %>% 
  relocate(TARGET_FLAG_PRED, TARGET_AMT_PRED, TARGET_FLAG_PROB)


## -------------------------------------------------------------------------------------------------------------------------------------------------------------
predictions <- eval_transformed %>% 
  dplyr::select(c(TARGET_FLAG_PRED, TARGET_AMT_PRED, TARGET_FLAG_PROB))


## ----eval=FALSE-----------------------------------------------------------------------------------------------------------------------------------------------
## write.csv(eval_transformed, 'eval_predictions.csv', row.names=F)
## write.csv(predictions, 'predictions_only.csv', row.names=F)


## ----results_table--------------------------------------------------------------------------------------------------------------------------------------------
DT::datatable(
      eval_transformed,
      extensions = c('Scroller'),
      options = list(scrollY = 350,
                     scrollX = 500,
                     deferRender = TRUE,
                     scroller = TRUE,
                     dom = 'lBfrtip',
                     fixedColumns = TRUE, 
                     searching = FALSE), 
      rownames = FALSE)