Clean up data
In this homework assignment, we will explore, analyze and model a
data set containing approximately 8000 records representing a customer
at an auto insurance company. We will build multiple linear regression
on the continuous variable TARGET_AMT and binary logistic
regression on the TARGET_FLAG using the training data to
predict the probability that a person will crash their car and also the
amount of money it will cost if the person does crash their car.
We are going to build several models using different techniques and variable selection. In order to best assess our predictive model, we created a test set within our training data, and split it along an 80/20 training/testing proportion, before applying the finalized models to a separate evaluation dataset that did not contain the target.
The insurance training dataset contains 8161 observations of 26 variables, each record represents a customer at an auto insurance company. The evaluation dataset contains 2141 observations of 26 variables. The descriptions of each column are below.
Each record has two response variables. The first response variable,
TARGET_FLAG, is a 1 or a 0. A “1” means that the person was
in a car crash. A zero means that the person was not in a car crash. The
second response variable is TARGET_AMT. This value is zero
if the person did not crash their car. But if they did crash their car,
this number will be a value greater than zero.
The training data can be previewed below. The
TARGET_FLAG column is the binary dependent variable
denoting if a car was in a crash (target = 1) or not (target = 0).
TARGET_AMT is a numeric dependent variable and represents
the amount of time the car spent on repairs in case of crash. The
minimum is 0 (car wasn’t in crash, no time spent on repairs), the
maximum is 107586.1.
The table below provides valuable descriptive statistics about the
training data. 14 variables are categorical, 12 variables are numeric.
There is no missing data for categorical variables while numeric
variables YOJ (years on job) has 5% of missing data,
CAR_AGE (vehicle age) has 6%, and AGE (age of
driver) has less than 1%. Most of the numeric variables have a minimum
of zero. Some numbers seem strange, we should deal with it later. For
example, CAR_AGE has the minimum value of -3. Some of the
variables are character though they should be numeric and vice versa.
Variable OLDCLAIM, BLUEBOOK,
HOME_VAL, INCOME have $ sign in front a
number, we should remove the sign and transform the variable to numeric.
Variables MSTATUS (Marital Status), EDUCATION,
JOB, CAR_TYPE, SEX AND
URBANICITY contain prefix z_ that should be
removed as well. ADD SOMETHING ELSE FOR SUMMARY
| Name | train_df |
| Number of rows | 8161 |
| Number of columns | 25 |
| _______________________ | |
| Column type frequency: | |
| character | 10 |
| numeric | 15 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| PARENT1 | 0 | 1 | 2 | 3 | 0 | 2 | 0 |
| MSTATUS | 0 | 1 | 2 | 3 | 0 | 2 | 0 |
| SEX | 0 | 1 | 1 | 1 | 0 | 2 | 0 |
| EDUCATION | 0 | 1 | 3 | 12 | 0 | 5 | 0 |
| JOB | 0 | 1 | 6 | 12 | 0 | 9 | 0 |
| CAR_USE | 0 | 1 | 7 | 10 | 0 | 2 | 0 |
| CAR_TYPE | 0 | 1 | 3 | 11 | 0 | 6 | 0 |
| RED_CAR | 0 | 1 | 2 | 3 | 0 | 2 | 0 |
| REVOKED | 0 | 1 | 2 | 3 | 0 | 2 | 0 |
| URBANICITY | 0 | 1 | 19 | 19 | 0 | 2 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 |
|---|---|---|---|---|---|---|---|---|---|
| TARGET_FLAG | 0 | 1.00 | 0.26 | 0.44 | 0 | 0 | 0 | 1 | 1.0 |
| TARGET_AMT | 0 | 1.00 | 1504.32 | 4704.03 | 0 | 0 | 0 | 1036 | 107586.1 |
| KIDSDRIV | 0 | 1.00 | 0.17 | 0.51 | 0 | 0 | 0 | 0 | 4.0 |
| AGE | 6 | 1.00 | 44.79 | 8.63 | 16 | 39 | 45 | 51 | 81.0 |
| HOMEKIDS | 0 | 1.00 | 0.72 | 1.12 | 0 | 0 | 0 | 1 | 5.0 |
| YOJ | 454 | 0.94 | 10.50 | 4.09 | 0 | 9 | 11 | 13 | 23.0 |
| INCOME | 445 | 0.95 | 61898.09 | 47572.68 | 0 | 28097 | 54028 | 85986 | 367030.0 |
| HOME_VAL | 464 | 0.94 | 154867.29 | 129123.77 | 0 | 0 | 161160 | 238724 | 885282.0 |
| TRAVTIME | 0 | 1.00 | 33.49 | 15.91 | 5 | 22 | 33 | 44 | 142.0 |
| BLUEBOOK | 0 | 1.00 | 15709.90 | 8419.73 | 1500 | 9280 | 14440 | 20850 | 69740.0 |
| TIF | 0 | 1.00 | 5.35 | 4.15 | 1 | 1 | 4 | 7 | 25.0 |
| OLDCLAIM | 0 | 1.00 | 4037.08 | 8777.14 | 0 | 0 | 0 | 4636 | 57037.0 |
| CLM_FREQ | 0 | 1.00 | 0.80 | 1.16 | 0 | 0 | 0 | 2 | 5.0 |
| MVR_PTS | 0 | 1.00 | 1.70 | 2.15 | 0 | 0 | 1 | 3 | 13.0 |
| CAR_AGE | 510 | 0.94 | 8.33 | 5.70 | -3 | 1 | 8 | 12 | 28.0 |
TARGET_FLAG variable. Class
1 takes 27% and class 0 takes 63% of the
target variable. As a result, we have unbalanced class distribution for
our target variable that we have to deal with, we have to take some
additional steps (bootstrapping, etc) before using logistic regression.
| Value | % |
|---|---|
| 0 | 0.74 |
| 1 | 0.26 |
The distribution of the second target variable
TARGET_AMT is right skewed, we will also transform the
variable to make it follow the normal distribution (log/BoxCox).
## `geom_smooth()` using formula = 'y ~ x'
First, we will remove prefixes z_ and $
together with the INDEX variable (identification
Variable).
Transform to factor variables TARGET_FLAG,
CAR_TYPE, CAR_USE, EDUCATION,
JOB, MSTATUS, PARENT1,
RED_CAR, REVOKED, SEX and
URBANICITY.
Transform to numeric variables INCOME,
HOME_VAL, BLUEBOOK, OLDCLAIM.
As we see below, there are no unwanted characters in the factor
variables. The JOB variable contains empty level
"", it was substituted with "Unknown".
RED_CAR levels will be “Yes/No” instead of “yes/no”.
## $PARENT1
## [1] "No" "Yes"
##
## $CAR_TYPE
## [1] "Minivan" "Panel Truck" "Pickup" "Sports Car" "SUV"
## [6] "Van"
##
## $JOB
## [1] "Blue Collar" "Clerical" "Doctor" "Home Maker" "Lawyer"
## [6] "Manager" "Professional" "Student" "Unknown"
##
## $CAR_USE
## [1] "Commercial" "Private"
##
## $URBANICITY
## [1] "Highly Rural/ Rural" "Highly Urban/ Urban"
##
## $RED_CAR
## [1] "No" "Yes"
##
## $REVOKED
## [1] "No" "Yes"
##
## $MSTATUS
## [1] "No" "Yes"
##
## $EDUCATION
## [1] "<High School" "Bachelors" "High School" "Masters" "PhD"
##
## $SEX
## [1] "F" "M"
First, we’ll manually adjust two special cases of missing or outlier values.
In cases where YOJ is zero and INCOME is
NA, we’ll set INCOME to zero to avoid imputing new values
over legitimate instances of non-employment.
## JOB
## 1 Home Maker
## 2 Student
## 36 Unknown
There is also at least one value of CAR_AGE that is less
than zero - we’ll assume this is a data collection error and set it to
zero (representing a brand-new car.)
We’ll use MICE to impute values for our remaining variables with
missing values - AGE, YOJ,
CAR_AGE, INCOME and HOME_VALUE.
We might reasonably assume there are relationships between them (older,
more years on the job may correlate with higher income and home value).
Taking simple means or medians might suppress those features, but MICE
should provide a better imputation.
We’ll split our traning data into train (80%) and validaton (20%) datasets.
Log transformation will be applied to variables INCOME,
TARGET_AMT, OLDCLAIM to transform their
distributions from right-skewed to the normal.
BoxCox transformation will be applied to variables
BLUEBOOK, TRAVTIME, TIF, so they
follow the normal distribution.
Binning values for CAR_AGE, HOME_VAL and
TIF:
Creating dummy variables for factors with two levels:
## [1] "TARGET_FLAG" "TARGET_AMT" "KIDSDRIV" "AGE"
## [5] "HOMEKIDS" "YOJ" "INCOME" "HOME_VAL"
## [9] "EDUCATION" "JOB" "TRAVTIME" "BLUEBOOK"
## [13] "TIF" "CAR_TYPE" "OLDCLAIM" "CLM_FREQ"
## [17] "MVR_PTS" "CAR_AGE" "CAR_AGE_BIN" "HOME_VAL_BIN"
## [21] "TIF_BIN" "MALE" "MARRIED" "LIC_REVOKED"
## [25] "CAR_RED" "PRIVATE_USE" "SINGLE_PARENT" "URBAN"
## [29] "CAR_CRASH"
## Warning in data.table::melt(.): The melt generic in data.table has been passed
## a data.frame and will attempt to redirect to the relevant reshape2 method;
## please note that reshape2 is deprecated, and this redirection is now deprecated
## as well. To continue using melt methods from reshape2 while both libraries are
## attached, e.g. melt.list, you can prepend the namespace like reshape2::melt(.).
## In the next version, this warning will become an error.
## No id variables; using all as measure variables
## Warning: attributes are not identical across measure variables; they will be
## dropped
We’ll split our traning data into train (80%) and validaton (20%) datasets.
The cv.glmnet() function was used to perform k-fold cross-validation with variable selection using lasso regularization. The following attribute settings were selected for the model:
TARGET_AMT and 0.7362 / n for all obervations with all
other values of TARGET_AMTThe resulting model is explored by extracting coefficients at two different values for lambda, lambda.min and lambda.1se respectively.
The coefficients extracted using lambda.min minimizes the mean cross-validated error. The resulting model incldues 33 non zero coefficients and has an AIC of 60.08. The coefficients extracted using lambda.1se produce the most regularized model (cross-validated error is within one standard error of the minimum). For this model there are 25 non zero coefficients and it has an AIC of 44.23
The coefficients extracted using lambda.1se results in the lowest AIC (highest model performance) with fewer predictor variables.
##
## Call: cv.glmnet(x = X, y = log(Y + 0.001), weights = weights_lm, type.measure = "mse", nfolds = 10, family = "gaussian", standardize = TRUE, alpha = 1)
##
## Measure: Mean-Squared Error
##
## Lambda Index Measure SE Nonzero
## min 0.01852 52 42.23 0.7938 33
## 1se 0.18960 27 42.93 0.6768 25
## $mse
## lambda.min
## 24335778
## attr(,"measure")
## [1] "Mean-Squared Error"
##
## $mae
## lambda.min
## 1482.862
## attr(,"measure")
## [1] "Mean Absolute Error"
## $AICc
## [1] 60.07401
##
## $BIC
## [1] 283.5957
## $AICc
## [1] 44.26779
##
## $BIC
## [1] 213.6643
A closer look at the remaining 37 no zero coefficients for the
selected lambda value of lambda.min we can observe the top predictor
variables URBAN1 predictor variable has the largest impact
on the response variable TARGET_AMT. The lasso model
Highly Urban/ Urban home work area contributes to the
likelyhood and amount insurance claim.
Reviewing the top 5 predictor variables that impact likelyhood and
cost associated with an accident: -
URBANICITYHighly Urban/ Urban - working or living in an
urban neighborhood increase expected cost of damages -
PRIVATE_USE1 - drives have a tendency to drive their
personal cars more carefully - CAR_TYPESports Car - drivers
with sports cars will have a higher cost of an acident -
LIC_REVOKED1 - - JOBManager- individuals who
report their job as manager tend to be
## 38 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) -1.916973e+00
## KIDSDRIV 8.730505e-01
## AGE -4.234171e-03
## HOMEKIDS 8.199127e-02
## YOJ .
## INCOME -1.294467e-05
## PARENT1Yes 9.512996e-01
## HOME_VAL -2.607327e-06
## MSTATUSYes -1.014881e+00
## SEXM .
## EDUCATIONBachelors -2.225378e-01
## EDUCATIONHigh School 3.185774e-01
## EDUCATIONMasters -3.543408e-01
## EDUCATIONPhD .
## JOBClerical 7.358522e-02
## JOBDoctor -1.519643e+00
## JOBHome Maker .
## JOBLawyer -3.220859e-02
## JOBManager -1.989085e+00
## JOBProfessional .
## JOBStudent .
## JOBUnknown .
## TRAVTIME 2.379521e-02
## CAR_USEPrivate -1.894053e+00
## BLUEBOOK -5.140567e-05
## TIF -1.223319e-01
## CAR_TYPEPanel Truck .
## CAR_TYPEPickup .
## CAR_TYPESports Car 1.012508e+00
## CAR_TYPESUV 3.266771e-01
## CAR_TYPEVan .
## RED_CARYes .
## OLDCLAIM .
## CLM_FREQ 4.871507e-01
## REVOKEDYes 1.705182e+00
## MVR_PTS 2.945666e-01
## CAR_AGE -1.106691e-02
## URBANICITYHighly Urban/ Urban 5.493494e+00
As mentioned earlier, the dataset has a high correlation between predictor variables. The lasso regression approaches this issue by selecting the variable with the highest correlation (in this casenox) and shrinking the remaining variables (as can be seen in the plot of coefficients).
The lasso model using coefficients extracted at lambda.min was used to predict the 65,320 test cases and compairing the predicted insurance AMT to the actual cost of a car crash. Technically
## # A tibble: 5 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 mape standard NaN
## 2 smape standard NaN
## 3 mase standard 0.577
## 4 mpe standard NaN
## 5 rmse standard 4963.
To reduce multicollinearity we can use regularization that means to keep all the features but reducing the magnitude of the coefficients of the model. This is a good solution when each predictor contributes to predict the dependent variable.
The Standardized Residuals plot seems to have a constant variance though there are some outliers.
The lasso regression solve multicollinearity issue by selecting the variable with the largest coefficient while setting the rest to (nearly) zero.
The cv.glmnet() function was used to perform k-fold cross-validation with variable selection using lasso regularization. The following attribute settings were selected for the model:
The resulting model is explored by extracting coefficients at two different values for lambda, lambda.min and lambda.1se respectively.
The coefficients extracted using lambda.min results in the lowest AIC and highest performance model.
## 38 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) -1.312526e+00
## KIDSDRIV 4.298166e-01
## AGE -2.477467e-03
## HOMEKIDS 3.746407e-02
## YOJ -5.842713e-03
## INCOME -3.607228e-06
## PARENT1Yes 3.622285e-01
## HOME_VAL -8.137271e-07
## MSTATUSYes -4.934565e-01
## SEXM 8.151839e-02
## EDUCATIONBachelors -3.770719e-01
## EDUCATIONHigh School 5.021857e-03
## EDUCATIONMasters -3.595104e-01
## EDUCATIONPhD -1.475271e-01
## JOBClerical 1.144408e-01
## JOBDoctor -9.877535e-01
## JOBHome Maker -3.086579e-02
## JOBLawyer -1.462496e-01
## JOBManager -8.007724e-01
## JOBProfessional -2.754665e-02
## JOBStudent .
## JOBUnknown -2.020300e-01
## TRAVTIME 1.368557e-02
## CAR_USEPrivate -7.456282e-01
## BLUEBOOK -2.472417e-05
## TIF -5.847298e-02
## CAR_TYPEPanel Truck 3.303235e-01
## CAR_TYPEPickup 3.768649e-01
## CAR_TYPESports Car 9.219403e-01
## CAR_TYPESUV 5.682755e-01
## CAR_TYPEVan 4.696603e-01
## RED_CARYes .
## OLDCLAIM -6.861829e-06
## CLM_FREQ 1.934167e-01
## REVOKEDYes 8.004127e-01
## MVR_PTS 1.086814e-01
## CAR_AGE 6.507721e-04
## URBANICITYHighly Urban/ Urban 2.240063e+00
## 38 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) -8.466527e-01
## KIDSDRIV 2.403583e-01
## AGE .
## HOMEKIDS 2.336057e-02
## YOJ .
## INCOME -3.818541e-06
## PARENT1Yes 3.137013e-01
## HOME_VAL -9.085407e-07
## MSTATUSYes -2.417553e-01
## SEXM .
## EDUCATIONBachelors .
## EDUCATIONHigh School 1.219993e-01
## EDUCATIONMasters -3.508506e-02
## EDUCATIONPhD .
## JOBClerical .
## JOBDoctor -2.314830e-01
## JOBHome Maker .
## JOBLawyer .
## JOBManager -5.122492e-01
## JOBProfessional .
## JOBStudent .
## JOBUnknown .
## TRAVTIME 5.776408e-03
## CAR_USEPrivate -5.179956e-01
## BLUEBOOK -1.709601e-05
## TIF -3.196130e-02
## CAR_TYPEPanel Truck .
## CAR_TYPEPickup .
## CAR_TYPESports Car 2.022568e-01
## CAR_TYPESUV 3.287222e-03
## CAR_TYPEVan .
## RED_CARYes .
## OLDCLAIM .
## CLM_FREQ 1.450079e-01
## REVOKEDYes 5.005163e-01
## MVR_PTS 8.694194e-02
## CAR_AGE -3.991969e-03
## URBANICITYHighly Urban/ Urban 1.642656e+00
##
## Call: cv.glmnet(x = X, y = Y, weights = weights_log, type.measure = "class", nfolds = 10, family = "binomial", link = "logit", standardize = TRUE, alpha = 1)
##
## Measure: Misclassification Error
##
## Lambda Index Measure SE Nonzero
## min 0.00092 55 0.2690 0.007814 35
## 1se 0.01807 23 0.2764 0.008373 21
## $deviance
## lambda.min
## 1.063873
## attr(,"measure")
## [1] "Binomial Deviance"
##
## $class
## lambda.min
## 0.2766544
## attr(,"measure")
## [1] "Misclassification Error"
##
## $auc
## [1] 0.8115942
## attr(,"measure")
## [1] "AUC"
##
## $mse
## lambda.min
## 0.3575389
## attr(,"measure")
## [1] "Mean-Squared Error"
##
## $mae
## lambda.min
## 0.7139408
## attr(,"measure")
## [1] "Mean Absolute Error"
## $AICc
## [1] -753.6031
##
## $BIC
## [1] -516.5564
## $AICc
## [1] -681.8251
##
## $BIC
## [1] -539.5061
Looking closer at the remaining coefficients for the selected lambda
values, the nox predictor variable has the largest
coefficient, and several variables including
lstat,age,zn,and
indus have coefficients close to zero. As mentioned
earlier, the dataset has a high correlation between predictor variables.
The lasso regression approaches this issue by selecting the variable
with the highest correlation (in this casenox) and
shrinking the remaining variables (as can be seen in the plot of
coefficients).
## 38 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) -1.312526e+00
## KIDSDRIV 4.298166e-01
## AGE -2.477467e-03
## HOMEKIDS 3.746407e-02
## YOJ -5.842713e-03
## INCOME -3.607228e-06
## PARENT1Yes 3.622285e-01
## HOME_VAL -8.137271e-07
## MSTATUSYes -4.934565e-01
## SEXM 8.151839e-02
## EDUCATIONBachelors -3.770719e-01
## EDUCATIONHigh School 5.021857e-03
## EDUCATIONMasters -3.595104e-01
## EDUCATIONPhD -1.475271e-01
## JOBClerical 1.144408e-01
## JOBDoctor -9.877535e-01
## JOBHome Maker -3.086579e-02
## JOBLawyer -1.462496e-01
## JOBManager -8.007724e-01
## JOBProfessional -2.754665e-02
## JOBStudent .
## JOBUnknown -2.020300e-01
## TRAVTIME 1.368557e-02
## CAR_USEPrivate -7.456282e-01
## BLUEBOOK -2.472417e-05
## TIF -5.847298e-02
## CAR_TYPEPanel Truck 3.303235e-01
## CAR_TYPEPickup 3.768649e-01
## CAR_TYPESports Car 9.219403e-01
## CAR_TYPESUV 5.682755e-01
## CAR_TYPEVan 4.696603e-01
## RED_CARYes .
## OLDCLAIM -6.861829e-06
## CLM_FREQ 1.934167e-01
## REVOKEDYes 8.004127e-01
## MVR_PTS 1.086814e-01
## CAR_AGE 6.507721e-04
## URBANICITYHighly Urban/ Urban 2.240063e+00
The coefficients extracted at the lambda.1se value are used to predict the relative crime rate for the testing data set. The confusion matrix highlights an accuracy of 83.9%.
## True
## Predicted 0 1 Total
## 0 876 107 983
## 1 326 324 650
## Total 1202 431 1633
##
## Percent Correct: 0.7348
Again we check linear relationship between independent variables and
the Logit of the target variable. Visually inspecting the results there
is a linear trend in the relationship but there are deviations from the
straight line in all variables with the exception of nox
and ptratio.
## `geom_smooth()` using formula = 'y ~ x'
The lasso regression solve multicollinearity issue by selecting the variable with the largest coefficient while setting the rest to (nearly) zero.
** Model 3.1: Lasso Linear Regression **
** Model 4.1: Lasso Logistic Regression **
## F1
## 0.599445
Regression
| Model | mape | smape | mase | mpe | rmse | AIC |
|---|---|---|---|---|---|---|
| M4.1:Lasso Linear | NaN | NaN | 0.5767 | NaN | 4963.406 | 60.074 |
Logistic
| Model | Accuracy | Classification error rate | F1 | Deviance | R2 | Sensitivity | Specificity | Precision | AIC |
|---|---|---|---|---|---|---|---|---|---|
| M4.1:Lasso Logistic | 0.8147 | 0.1853 | 0.5994 | 1.0468 | NA | 0.7517 | 0.7288 | 0.4985 | -753.6031 |