Clean up data

Overview

In this homework assignment, we will explore, analyze and model a data set containing approximately 8000 records representing a customer at an auto insurance company. We will build multiple linear regression on the continuous variable TARGET_AMT and binary logistic regression on the TARGET_FLAG using the training data to predict the probability that a person will crash their car and also the amount of money it will cost if the person does crash their car.

We are going to build several models using different techniques and variable selection. In order to best assess our predictive model, we created a test set within our training data, and split it along an 80/20 training/testing proportion, before applying the finalized models to a separate evaluation dataset that did not contain the target.

1. Data Exploration

The insurance training dataset contains 8161 observations of 26 variables, each record represents a customer at an auto insurance company. The evaluation dataset contains 2141 observations of 26 variables. The descriptions of each column are below.

Each record has two response variables. The first response variable, TARGET_FLAG, is a 1 or a 0. A “1” means that the person was in a car crash. A zero means that the person was not in a car crash. The second response variable is TARGET_AMT. This value is zero if the person did not crash their car. But if they did crash their car, this number will be a value greater than zero.

1.1 Summary Statistics

The training data can be previewed below. The TARGET_FLAG column is the binary dependent variable denoting if a car was in a crash (target = 1) or not (target = 0). TARGET_AMT is a numeric dependent variable and represents the amount of time the car spent on repairs in case of crash. The minimum is 0 (car wasn’t in crash, no time spent on repairs), the maximum is 107586.1.

The table below provides valuable descriptive statistics about the training data. 14 variables are categorical, 12 variables are numeric. There is no missing data for categorical variables while numeric variables YOJ (years on job) has 5% of missing data, CAR_AGE (vehicle age) has 6%, and AGE (age of driver) has less than 1%. Most of the numeric variables have a minimum of zero. Some numbers seem strange, we should deal with it later. For example, CAR_AGE has the minimum value of -3. Some of the variables are character though they should be numeric and vice versa. Variable OLDCLAIM, BLUEBOOK, HOME_VAL, INCOME have $ sign in front a number, we should remove the sign and transform the variable to numeric. Variables MSTATUS (Marital Status), EDUCATION, JOB, CAR_TYPE, SEX AND URBANICITY contain prefix z_ that should be removed as well. ADD SOMETHING ELSE FOR SUMMARY

Data summary
Name train_df
Number of rows 8161
Number of columns 25
_______________________
Column type frequency:
character 10
numeric 15
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
PARENT1 0 1 2 3 0 2 0
MSTATUS 0 1 2 3 0 2 0
SEX 0 1 1 1 0 2 0
EDUCATION 0 1 3 12 0 5 0
JOB 0 1 6 12 0 9 0
CAR_USE 0 1 7 10 0 2 0
CAR_TYPE 0 1 3 11 0 6 0
RED_CAR 0 1 2 3 0 2 0
REVOKED 0 1 2 3 0 2 0
URBANICITY 0 1 19 19 0 2 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100
TARGET_FLAG 0 1.00 0.26 0.44 0 0 0 1 1.0
TARGET_AMT 0 1.00 1504.32 4704.03 0 0 0 1036 107586.1
KIDSDRIV 0 1.00 0.17 0.51 0 0 0 0 4.0
AGE 6 1.00 44.79 8.63 16 39 45 51 81.0
HOMEKIDS 0 1.00 0.72 1.12 0 0 0 1 5.0
YOJ 454 0.94 10.50 4.09 0 9 11 13 23.0
INCOME 445 0.95 61898.09 47572.68 0 28097 54028 85986 367030.0
HOME_VAL 464 0.94 154867.29 129123.77 0 0 161160 238724 885282.0
TRAVTIME 0 1.00 33.49 15.91 5 22 33 44 142.0
BLUEBOOK 0 1.00 15709.90 8419.73 1500 9280 14440 20850 69740.0
TIF 0 1.00 5.35 4.15 1 1 4 7 25.0
OLDCLAIM 0 1.00 4037.08 8777.14 0 0 0 4636 57037.0
CLM_FREQ 0 1.00 0.80 1.16 0 0 0 2 5.0
MVR_PTS 0 1.00 1.70 2.15 0 0 1 3 13.0
CAR_AGE 510 0.94 8.33 5.70 -3 1 8 12 28.0

1.2 Distributions

Before building a model, we need to make sure that we have both classes equally presented in our TARGET_FLAG variable. Class 1 takes 27% and class 0 takes 63% of the target variable. As a result, we have unbalanced class distribution for our target variable that we have to deal with, we have to take some additional steps (bootstrapping, etc) before using logistic regression.
Distribution of Target Flag
Value %
0 0.74
1 0.26

The distribution of the second target variable TARGET_AMT is right skewed, we will also transform the variable to make it follow the normal distribution (log/BoxCox).

1.3 Box Plots

1.4 Scatter Plot

## `geom_smooth()` using formula = 'y ~ x'

1.5 Correlation Matrix

2. Data preparation

2.1 Data types

First, we will remove prefixes z_ and $ together with the INDEX variable (identification Variable).

Transform to factor variables TARGET_FLAG, CAR_TYPE, CAR_USE, EDUCATION, JOB, MSTATUS, PARENT1, RED_CAR, REVOKED, SEX and URBANICITY.

Transform to numeric variables INCOME, HOME_VAL, BLUEBOOK, OLDCLAIM.

As we see below, there are no unwanted characters in the factor variables. The JOB variable contains empty level "", it was substituted with "Unknown". RED_CAR levels will be “Yes/No” instead of “yes/no”.

## $PARENT1
## [1] "No"  "Yes"
## 
## $CAR_TYPE
## [1] "Minivan"     "Panel Truck" "Pickup"      "Sports Car"  "SUV"        
## [6] "Van"        
## 
## $JOB
## [1] "Blue Collar"  "Clerical"     "Doctor"       "Home Maker"   "Lawyer"      
## [6] "Manager"      "Professional" "Student"      "Unknown"     
## 
## $CAR_USE
## [1] "Commercial" "Private"   
## 
## $URBANICITY
## [1] "Highly Rural/ Rural" "Highly Urban/ Urban"
## 
## $RED_CAR
## [1] "No"  "Yes"
## 
## $REVOKED
## [1] "No"  "Yes"
## 
## $MSTATUS
## [1] "No"  "Yes"
## 
## $EDUCATION
## [1] "<High School" "Bachelors"    "High School"  "Masters"      "PhD"         
## 
## $SEX
## [1] "F" "M"

2.3 Transformations and Missing Values

First, we’ll manually adjust two special cases of missing or outlier values.

In cases where YOJ is zero and INCOME is NA, we’ll set INCOME to zero to avoid imputing new values over legitimate instances of non-employment.

##           JOB
## 1  Home Maker
## 2     Student
## 36    Unknown

There is also at least one value of CAR_AGE that is less than zero - we’ll assume this is a data collection error and set it to zero (representing a brand-new car.)

We’ll use MICE to impute values for our remaining variables with missing values - AGE, YOJ, CAR_AGE, INCOME and HOME_VALUE. We might reasonably assume there are relationships between them (older, more years on the job may correlate with higher income and home value). Taking simple means or medians might suppress those features, but MICE should provide a better imputation.

We’ll split our traning data into train (80%) and validaton (20%) datasets.

Log transformation will be applied to variables INCOME, TARGET_AMT, OLDCLAIM to transform their distributions from right-skewed to the normal.

BoxCox transformation will be applied to variables BLUEBOOK, TRAVTIME, TIF, so they follow the normal distribution.

Binning values for CAR_AGE, HOME_VAL and TIF:

Creating dummy variables for factors with two levels:

2.4 Visualizations

##  [1] "TARGET_FLAG"   "TARGET_AMT"    "KIDSDRIV"      "AGE"          
##  [5] "HOMEKIDS"      "YOJ"           "INCOME"        "HOME_VAL"     
##  [9] "EDUCATION"     "JOB"           "TRAVTIME"      "BLUEBOOK"     
## [13] "TIF"           "CAR_TYPE"      "OLDCLAIM"      "CLM_FREQ"     
## [17] "MVR_PTS"       "CAR_AGE"       "CAR_AGE_BIN"   "HOME_VAL_BIN" 
## [21] "TIF_BIN"       "MALE"          "MARRIED"       "LIC_REVOKED"  
## [25] "CAR_RED"       "PRIVATE_USE"   "SINGLE_PARENT" "URBAN"        
## [29] "CAR_CRASH"
## Warning in data.table::melt(.): The melt generic in data.table has been passed
## a data.frame and will attempt to redirect to the relevant reshape2 method;
## please note that reshape2 is deprecated, and this redirection is now deprecated
## as well. To continue using melt methods from reshape2 while both libraries are
## attached, e.g. melt.list, you can prepend the namespace like reshape2::melt(.).
## In the next version, this warning will become an error.
## No id variables; using all as measure variables
## Warning: attributes are not identical across measure variables; they will be
## dropped

2.x Training and Validation Sets

We’ll split our traning data into train (80%) and validaton (20%) datasets.

3. Multiple linear regression

3.1 Model 1 - Lasso

The cv.glmnet() function was used to perform k-fold cross-validation with variable selection using lasso regularization. The following attribute settings were selected for the model:

  • type.measure = “mse” - The type.measure is set to minimize the mean squar error for the model.
  • nfold = 10 - Given the size of the dataset we defaulted to 10 fold cross-validation.
  • family = gaussian - For linear Regression
  • alpha = 1 - The alpha value of 1 sets the variable shrinkage method to lasso.
  • weights = a weight of 0.2638 / n for observation with a 0 TARGET_AMT and 0.7362 / n for all obervations with all other values of TARGET_AMT
  • standardize = TRUE - Finally, we explicitly set the standardization attribute to TRUE; this will normalize the prediction variables around a mean of zero and a standard deviation of one before modeling.

The resulting model is explored by extracting coefficients at two different values for lambda, lambda.min and lambda.1se respectively.

The coefficients extracted using lambda.min minimizes the mean cross-validated error. The resulting model incldues 33 non zero coefficients and has an AIC of 60.08. The coefficients extracted using lambda.1se produce the most regularized model (cross-validated error is within one standard error of the minimum). For this model there are 25 non zero coefficients and it has an AIC of 44.23

The coefficients extracted using lambda.1se results in the lowest AIC (highest model performance) with fewer predictor variables.

## 
## Call:  cv.glmnet(x = X, y = log(Y + 0.001), weights = weights_lm, type.measure = "mse",      nfolds = 10, family = "gaussian", standardize = TRUE, alpha = 1) 
## 
## Measure: Mean-Squared Error 
## 
##      Lambda Index Measure     SE Nonzero
## min 0.01852    52   42.23 0.7938      33
## 1se 0.18960    27   42.93 0.6768      25

## $mse
## lambda.min 
##   24335778 
## attr(,"measure")
## [1] "Mean-Squared Error"
## 
## $mae
## lambda.min 
##   1482.862 
## attr(,"measure")
## [1] "Mean Absolute Error"
## $AICc
## [1] 60.07401
## 
## $BIC
## [1] 283.5957
## $AICc
## [1] 44.26779
## 
## $BIC
## [1] 213.6643

A closer look at the remaining 37 no zero coefficients for the selected lambda value of lambda.min we can observe the top predictor variables URBAN1 predictor variable has the largest impact on the response variable TARGET_AMT. The lasso model Highly Urban/ Urban home work area contributes to the likelyhood and amount insurance claim.

Reviewing the top 5 predictor variables that impact likelyhood and cost associated with an accident: - URBANICITYHighly Urban/ Urban - working or living in an urban neighborhood increase expected cost of damages - PRIVATE_USE1 - drives have a tendency to drive their personal cars more carefully - CAR_TYPESports Car - drivers with sports cars will have a higher cost of an acident - LIC_REVOKED1 - - JOBManager- individuals who report their job as manager tend to be

## 38 x 1 sparse Matrix of class "dgCMatrix"
##                                          s1
## (Intercept)                   -1.916973e+00
## KIDSDRIV                       8.730505e-01
## AGE                           -4.234171e-03
## HOMEKIDS                       8.199127e-02
## YOJ                            .           
## INCOME                        -1.294467e-05
## PARENT1Yes                     9.512996e-01
## HOME_VAL                      -2.607327e-06
## MSTATUSYes                    -1.014881e+00
## SEXM                           .           
## EDUCATIONBachelors            -2.225378e-01
## EDUCATIONHigh School           3.185774e-01
## EDUCATIONMasters              -3.543408e-01
## EDUCATIONPhD                   .           
## JOBClerical                    7.358522e-02
## JOBDoctor                     -1.519643e+00
## JOBHome Maker                  .           
## JOBLawyer                     -3.220859e-02
## JOBManager                    -1.989085e+00
## JOBProfessional                .           
## JOBStudent                     .           
## JOBUnknown                     .           
## TRAVTIME                       2.379521e-02
## CAR_USEPrivate                -1.894053e+00
## BLUEBOOK                      -5.140567e-05
## TIF                           -1.223319e-01
## CAR_TYPEPanel Truck            .           
## CAR_TYPEPickup                 .           
## CAR_TYPESports Car             1.012508e+00
## CAR_TYPESUV                    3.266771e-01
## CAR_TYPEVan                    .           
## RED_CARYes                     .           
## OLDCLAIM                       .           
## CLM_FREQ                       4.871507e-01
## REVOKEDYes                     1.705182e+00
## MVR_PTS                        2.945666e-01
## CAR_AGE                       -1.106691e-02
## URBANICITYHighly Urban/ Urban  5.493494e+00

As mentioned earlier, the dataset has a high correlation between predictor variables. The lasso regression approaches this issue by selecting the variable with the highest correlation (in this casenox) and shrinking the remaining variables (as can be seen in the plot of coefficients).

Model Performance

The lasso model using coefficients extracted at lambda.min was used to predict the 65,320 test cases and compairing the predicted insurance AMT to the actual cost of a car crash. Technically

## # A tibble: 5 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 mape    standard     NaN    
## 2 smape   standard     NaN    
## 3 mase    standard       0.577
## 4 mpe     standard     NaN    
## 5 rmse    standard    4963.

Model Assumptions

To reduce multicollinearity we can use regularization that means to keep all the features but reducing the magnitude of the coefficients of the model. This is a good solution when each predictor contributes to predict the dependent variable.

The Standardized Residuals plot seems to have a constant variance though there are some outliers.

The lasso regression solve multicollinearity issue by selecting the variable with the largest coefficient while setting the rest to (nearly) zero.

3.2 Model 2 -

3.3 Model 3 -

3.4 Model selection

4. Binary logistic regression

4.1 Model 1 - Lasso

The cv.glmnet() function was used to perform k-fold cross-validation with variable selection using lasso regularization. The following attribute settings were selected for the model:

  • type.measure = “class” - The type.measure is set to class to minimize the misclassification errors of the model since the accurate classification of the validation data set is the desired outcome.
  • nfold = 10 - Through our exploration, we did not uncover any hard and fast rules governing the optimal number of folds for n-fold cross-validation. Given the size of the training dataset, we opted for 5-fold cross-validation.
  • family = binomial - For Logistic Regression, the family attribute of the function is set to binomial.
  • link = logit - For this model, we choose the default link function for a logistic model.
  • alpha =1 - The alpha value of 1 sets the variable shrinkage method to lasso.
  • standardize = TRUE - Finally, we explicitly set the standardization attribute to TRUE; this will normalize the prediction variables around a mean of zero and a standard deviation of one before modeling.

The resulting model is explored by extracting coefficients at two different values for lambda, lambda.min and lambda.1se respectively.

  • The coefficients extracted using lambda.min minimizes the mean cross-validated error. The resulting model drops one predictor variable rm and has an AIC of -1629.875.
  • The coefficients extracted using lambda.1se produce the most regularized model (cross-validated error is within one standard error of the minimum). For this model the ‘rm’ and ‘tax’ predictor variables dropout, generating an AIC of -1565.658.

The coefficients extracted using lambda.min results in the lowest AIC and highest performance model.

## 38 x 1 sparse Matrix of class "dgCMatrix"
##                                          s1
## (Intercept)                   -1.312526e+00
## KIDSDRIV                       4.298166e-01
## AGE                           -2.477467e-03
## HOMEKIDS                       3.746407e-02
## YOJ                           -5.842713e-03
## INCOME                        -3.607228e-06
## PARENT1Yes                     3.622285e-01
## HOME_VAL                      -8.137271e-07
## MSTATUSYes                    -4.934565e-01
## SEXM                           8.151839e-02
## EDUCATIONBachelors            -3.770719e-01
## EDUCATIONHigh School           5.021857e-03
## EDUCATIONMasters              -3.595104e-01
## EDUCATIONPhD                  -1.475271e-01
## JOBClerical                    1.144408e-01
## JOBDoctor                     -9.877535e-01
## JOBHome Maker                 -3.086579e-02
## JOBLawyer                     -1.462496e-01
## JOBManager                    -8.007724e-01
## JOBProfessional               -2.754665e-02
## JOBStudent                     .           
## JOBUnknown                    -2.020300e-01
## TRAVTIME                       1.368557e-02
## CAR_USEPrivate                -7.456282e-01
## BLUEBOOK                      -2.472417e-05
## TIF                           -5.847298e-02
## CAR_TYPEPanel Truck            3.303235e-01
## CAR_TYPEPickup                 3.768649e-01
## CAR_TYPESports Car             9.219403e-01
## CAR_TYPESUV                    5.682755e-01
## CAR_TYPEVan                    4.696603e-01
## RED_CARYes                     .           
## OLDCLAIM                      -6.861829e-06
## CLM_FREQ                       1.934167e-01
## REVOKEDYes                     8.004127e-01
## MVR_PTS                        1.086814e-01
## CAR_AGE                        6.507721e-04
## URBANICITYHighly Urban/ Urban  2.240063e+00
## 38 x 1 sparse Matrix of class "dgCMatrix"
##                                          s1
## (Intercept)                   -8.466527e-01
## KIDSDRIV                       2.403583e-01
## AGE                            .           
## HOMEKIDS                       2.336057e-02
## YOJ                            .           
## INCOME                        -3.818541e-06
## PARENT1Yes                     3.137013e-01
## HOME_VAL                      -9.085407e-07
## MSTATUSYes                    -2.417553e-01
## SEXM                           .           
## EDUCATIONBachelors             .           
## EDUCATIONHigh School           1.219993e-01
## EDUCATIONMasters              -3.508506e-02
## EDUCATIONPhD                   .           
## JOBClerical                    .           
## JOBDoctor                     -2.314830e-01
## JOBHome Maker                  .           
## JOBLawyer                      .           
## JOBManager                    -5.122492e-01
## JOBProfessional                .           
## JOBStudent                     .           
## JOBUnknown                     .           
## TRAVTIME                       5.776408e-03
## CAR_USEPrivate                -5.179956e-01
## BLUEBOOK                      -1.709601e-05
## TIF                           -3.196130e-02
## CAR_TYPEPanel Truck            .           
## CAR_TYPEPickup                 .           
## CAR_TYPESports Car             2.022568e-01
## CAR_TYPESUV                    3.287222e-03
## CAR_TYPEVan                    .           
## RED_CARYes                     .           
## OLDCLAIM                       .           
## CLM_FREQ                       1.450079e-01
## REVOKEDYes                     5.005163e-01
## MVR_PTS                        8.694194e-02
## CAR_AGE                       -3.991969e-03
## URBANICITYHighly Urban/ Urban  1.642656e+00
## 
## Call:  cv.glmnet(x = X, y = Y, weights = weights_log, type.measure = "class",      nfolds = 10, family = "binomial", link = "logit", standardize = TRUE,      alpha = 1) 
## 
## Measure: Misclassification Error 
## 
##      Lambda Index Measure       SE Nonzero
## min 0.00092    55  0.2690 0.007814      35
## 1se 0.01807    23  0.2764 0.008373      21

## $deviance
## lambda.min 
##   1.063873 
## attr(,"measure")
## [1] "Binomial Deviance"
## 
## $class
## lambda.min 
##  0.2766544 
## attr(,"measure")
## [1] "Misclassification Error"
## 
## $auc
## [1] 0.8115942
## attr(,"measure")
## [1] "AUC"
## 
## $mse
## lambda.min 
##  0.3575389 
## attr(,"measure")
## [1] "Mean-Squared Error"
## 
## $mae
## lambda.min 
##  0.7139408 
## attr(,"measure")
## [1] "Mean Absolute Error"
## $AICc
## [1] -753.6031
## 
## $BIC
## [1] -516.5564
## $AICc
## [1] -681.8251
## 
## $BIC
## [1] -539.5061

Looking closer at the remaining coefficients for the selected lambda values, the nox predictor variable has the largest coefficient, and several variables including lstat,age,zn,and indus have coefficients close to zero. As mentioned earlier, the dataset has a high correlation between predictor variables. The lasso regression approaches this issue by selecting the variable with the highest correlation (in this casenox) and shrinking the remaining variables (as can be seen in the plot of coefficients).

## 38 x 1 sparse Matrix of class "dgCMatrix"
##                                          s1
## (Intercept)                   -1.312526e+00
## KIDSDRIV                       4.298166e-01
## AGE                           -2.477467e-03
## HOMEKIDS                       3.746407e-02
## YOJ                           -5.842713e-03
## INCOME                        -3.607228e-06
## PARENT1Yes                     3.622285e-01
## HOME_VAL                      -8.137271e-07
## MSTATUSYes                    -4.934565e-01
## SEXM                           8.151839e-02
## EDUCATIONBachelors            -3.770719e-01
## EDUCATIONHigh School           5.021857e-03
## EDUCATIONMasters              -3.595104e-01
## EDUCATIONPhD                  -1.475271e-01
## JOBClerical                    1.144408e-01
## JOBDoctor                     -9.877535e-01
## JOBHome Maker                 -3.086579e-02
## JOBLawyer                     -1.462496e-01
## JOBManager                    -8.007724e-01
## JOBProfessional               -2.754665e-02
## JOBStudent                     .           
## JOBUnknown                    -2.020300e-01
## TRAVTIME                       1.368557e-02
## CAR_USEPrivate                -7.456282e-01
## BLUEBOOK                      -2.472417e-05
## TIF                           -5.847298e-02
## CAR_TYPEPanel Truck            3.303235e-01
## CAR_TYPEPickup                 3.768649e-01
## CAR_TYPESports Car             9.219403e-01
## CAR_TYPESUV                    5.682755e-01
## CAR_TYPEVan                    4.696603e-01
## RED_CARYes                     .           
## OLDCLAIM                      -6.861829e-06
## CLM_FREQ                       1.934167e-01
## REVOKEDYes                     8.004127e-01
## MVR_PTS                        1.086814e-01
## CAR_AGE                        6.507721e-04
## URBANICITYHighly Urban/ Urban  2.240063e+00

Model Performance

The coefficients extracted at the lambda.1se value are used to predict the relative crime rate for the testing data set. The confusion matrix highlights an accuracy of 83.9%.

##          True
## Predicted    0   1 Total
##     0      876 107   983
##     1      326 324   650
##     Total 1202 431  1633
## 
##  Percent Correct:  0.7348

Checking Model Assumptions

Again we check linear relationship between independent variables and the Logit of the target variable. Visually inspecting the results there is a linear trend in the relationship but there are deviations from the straight line in all variables with the exception of nox and ptratio.

## `geom_smooth()` using formula = 'y ~ x'

The lasso regression solve multicollinearity issue by selecting the variable with the largest coefficient while setting the rest to (nearly) zero.

4.2 Model 2 -

4.3 Model 3 -

4.4 Model selection

** Model 3.1: Lasso Linear Regression **

** Model 4.1: Lasso Logistic Regression **

##       F1 
## 0.599445

Regression

Model mape smape mase mpe rmse AIC
M4.1:Lasso Linear NaN NaN 0.5767 NaN 4963.406 60.074

Logistic

Model Accuracy Classification error rate F1 Deviance R2 Sensitivity Specificity Precision AIC
M4.1:Lasso Logistic 0.8147 0.1853 0.5994 1.0468 NA 0.7517 0.7288 0.4985 -753.6031

5. Predictions

6. Conclusion

7. References

Appendix: R code