This project is based on a dataset from an auto insurance company’s customers. It builds two predictive models that estimate a) the probability that a customer would have a car accident and b) the monetary amount of insurance claims in case of the accident.
After an initial variable inspection, three logistic regression models, and two multiple linear regression models were prepared and compared on test data.
Based on classification performance metrics, the best model is suggested and applied on the evaluation dataset.
The training dataset contains 8161 observations of 26 variables (one index, two response, and 23 predictor variables).
Each record (row) represents a set of attributes of an insurance company individual customer that are related to their socio-demographic profile and the insured vehicle. The binary response variable TARGET_FLAG
has 1 if the customer’s car was in a crash, and 0 if not. The continous response variable TARGET_AMT
defines the cost related to the car crash if it happened.
The variables are:
Summaries for the individual variables (after some cleaning) are provided below.
## INDEX TARGET_FLAG TARGET_AMT KIDSDRIV
## Min. : 1 0:6008 Min. : 0 Min. :0.0000
## 1st Qu.: 2559 1:2153 1st Qu.: 0 1st Qu.:0.0000
## Median : 5133 Median : 0 Median :0.0000
## Mean : 5152 Mean : 1504 Mean :0.1711
## 3rd Qu.: 7745 3rd Qu.: 1036 3rd Qu.:0.0000
## Max. :10302 Max. :107586 Max. :4.0000
##
## AGE HOMEKIDS YOJ INCOME
## Min. :16.00 Min. :0.0000 Min. : 0.0 Min. : 0
## 1st Qu.:39.00 1st Qu.:0.0000 1st Qu.: 9.0 1st Qu.: 28097
## Median :45.00 Median :0.0000 Median :11.0 Median : 54028
## Mean :44.79 Mean :0.7212 Mean :10.5 Mean : 61898
## 3rd Qu.:51.00 3rd Qu.:1.0000 3rd Qu.:13.0 3rd Qu.: 85986
## Max. :81.00 Max. :5.0000 Max. :23.0 Max. :367030
## NA's :6 NA's :454 NA's :445
## PARENT1 HOME_VAL MSTATUS SEX EDUCATION
## No :7084 Min. : 0 Yes :4894 M :3786 <High_School :1203
## Yes:1077 1st Qu.: 0 z_No:3267 z_F:4375 Bachelors :2242
## Median :161160 Masters :1658
## Mean :154867 PhD : 728
## 3rd Qu.:238724 z_High_School:2330
## Max. :885282
## NA's :464
## JOB TRAVTIME CAR_USE BLUEBOOK
## z_Blue_Collar:1825 Min. : 5.00 Commercial:3029 Min. : 1500
## Clerical :1271 1st Qu.: 22.00 Private :5132 1st Qu.: 9280
## Professional :1117 Median : 33.00 Median :14440
## Manager : 988 Mean : 33.49 Mean :15710
## Lawyer : 835 3rd Qu.: 44.00 3rd Qu.:20850
## Student : 712 Max. :142.00 Max. :69740
## (Other) :1413
## TIF CAR_TYPE RED_CAR OLDCLAIM
## Min. : 1.000 Minivan :2145 no :5783 Min. : 0
## 1st Qu.: 1.000 Panel_Truck: 676 yes:2378 1st Qu.: 0
## Median : 4.000 Pickup :1389 Median : 0
## Mean : 5.351 Sports_Car : 907 Mean : 4037
## 3rd Qu.: 7.000 Van : 750 3rd Qu.: 4636
## Max. :25.000 z_SUV :2294 Max. :57037
##
## CLM_FREQ REVOKED MVR_PTS CAR_AGE
## Min. :0.0000 No :7161 Min. : 0.000 Min. :-3.000
## 1st Qu.:0.0000 Yes:1000 1st Qu.: 0.000 1st Qu.: 1.000
## Median :0.0000 Median : 1.000 Median : 8.000
## Mean :0.7986 Mean : 1.696 Mean : 8.328
## 3rd Qu.:2.0000 3rd Qu.: 3.000 3rd Qu.:12.000
## Max. :5.0000 Max. :13.000 Max. :28.000
## NA's :510
## URBANICITY
## Highly_Urban/ Urban :6492
## z_Highly_Rural/ Rural:1669
##
##
##
##
##
From the summaries and the chart above we can see that multiple variables have missing data, but the amount of NAs is not very high.
Frequency counts of class occurrence for the descrete variables are provided below
Histograms of the distributions of the remaining continuous variables are provided below
We can see that - as expected based on the description above - the values of the variables are non-negative. This means that we should assume a gamma distribution as the generating function, which impacts the choice of regression models later on. In addition, the variables KIDSDRIV
, HOMEKIDS
, OLDCLAIM
, and HOMEVAL
have a significant share of observations that are equal to zero and do not match the rest of the distribution of the data.
A check for near-zero variance did not show a positive result for any variable.
The pairwise correlations between the continuous variables are displayed below
This analysis shows generally weak correlations between the continous predictors and the continous response, as well as between individual predictors. However, several predictors do show moderately strong relationships:
INCOME
, HOME_VAL
(home value), BLUEBOOK
(car value), AGE
, YOJ
(years in the same job)HOMEKIDS
(the number of children at home ) is negatively linked with age, income, and age of the carCLM_FREQ
(claim frequency in the past 5 years), as well as OLDCLAIM
(the total claimed amount), and MVR-PTS
(the number of Motor Vehicle Recors Points) are weakly positively linked with TARGET_AMT
(the payout in case the car was in a crash)These relationships and their connection to the target class are inspected in scatter plots provided in the appendix.
Pairwise relationship with the binary target variable
First we inspect the relationship between the binary outcome TARGET_FLAG
and the continous predictors using boxplots.
Analyzing the boxplots we can see that while there is some variance in the location of the medians per level of the target variable, none of the predictors by itself appears to be particularly informative for the target. This confirms the finding on weak correlations with the continous target variable discussed above.
Moving to discrete predictors, we can inspect the relative frequencies of occurrence of each factor level in conjunction with the target level (0 or 1) using mosaic plots.
From the inspection of the mosaic plots we can conclude that most of the discrete predictors carry information related to the level of the outcome variable, with the exception of SEX
, and RED_CAR
. Based on the distribution of the data, neither being male or female, nor having a red car plays a role in the probability of being in a car crash.
Summary of the findings
SEX
, and RED_CAR
, the discrete predictors appear highly relevant for predicting the binary outcomeThe variables encoded as strings in the input data representing dollar values, e.g. “$21,100” were converted to numeric variables.
In addition, spaces in factor variables’ levels were replaced by underscores in order to comply with the requirements of the models generating dummy variables from these factors.
As discussed above, several variables in the dataset have missing observations. However, none of the predictors show near-zero variance.
Instead of removing all rows with incomplete observations, an imputation of missing data using the predictive mean matching approach implemented in the mice package is applied.
In addition, the continous values were also centered and scaled in several models that were built.
##
## iter imp variable
## 1 1 AGE INCOME YOJ HOME_VAL CAR_AGE
## 2 1 AGE INCOME YOJ HOME_VAL CAR_AGE
## 3 1 AGE INCOME YOJ HOME_VAL CAR_AGE
## 4 1 AGE INCOME YOJ HOME_VAL CAR_AGE
## 5 1 AGE INCOME YOJ HOME_VAL CAR_AGE
## 6 1 AGE INCOME YOJ HOME_VAL CAR_AGE
## 7 1 AGE INCOME YOJ HOME_VAL CAR_AGE
## 8 1 AGE INCOME YOJ HOME_VAL CAR_AGE
## 9 1 AGE INCOME YOJ HOME_VAL CAR_AGE
## 10 1 AGE INCOME YOJ HOME_VAL CAR_AGE
## 11 1 AGE INCOME YOJ HOME_VAL CAR_AGE
## 12 1 AGE INCOME YOJ HOME_VAL CAR_AGE
## 13 1 AGE INCOME YOJ HOME_VAL CAR_AGE
## 14 1 AGE INCOME YOJ HOME_VAL CAR_AGE
## 15 1 AGE INCOME YOJ HOME_VAL CAR_AGE
## 16 1 AGE INCOME YOJ HOME_VAL CAR_AGE
## 17 1 AGE INCOME YOJ HOME_VAL CAR_AGE
## 18 1 AGE INCOME YOJ HOME_VAL CAR_AGE
## 19 1 AGE INCOME YOJ HOME_VAL CAR_AGE
## 20 1 AGE INCOME YOJ HOME_VAL CAR_AGE
## 21 1 AGE INCOME YOJ HOME_VAL CAR_AGE
## 22 1 AGE INCOME YOJ HOME_VAL CAR_AGE
## 23 1 AGE INCOME YOJ HOME_VAL CAR_AGE
## 24 1 AGE INCOME YOJ HOME_VAL CAR_AGE
## 25 1 AGE INCOME YOJ HOME_VAL CAR_AGE
## 26 1 AGE INCOME YOJ HOME_VAL CAR_AGE
## 27 1 AGE INCOME YOJ HOME_VAL CAR_AGE
## 28 1 AGE INCOME YOJ HOME_VAL CAR_AGE
## 29 1 AGE INCOME YOJ HOME_VAL CAR_AGE
## 30 1 AGE INCOME YOJ HOME_VAL CAR_AGE
## 31 1 AGE INCOME YOJ HOME_VAL CAR_AGE
## 32 1 AGE INCOME YOJ HOME_VAL CAR_AGE
## 33 1 AGE INCOME YOJ HOME_VAL CAR_AGE
## 34 1 AGE INCOME YOJ HOME_VAL CAR_AGE
## 35 1 AGE INCOME YOJ HOME_VAL CAR_AGE
## 36 1 AGE INCOME YOJ HOME_VAL CAR_AGE
## 37 1 AGE INCOME YOJ HOME_VAL CAR_AGE
## 38 1 AGE INCOME YOJ HOME_VAL CAR_AGE
## 39 1 AGE INCOME YOJ HOME_VAL CAR_AGE
## 40 1 AGE INCOME YOJ HOME_VAL CAR_AGE
## 41 1 AGE INCOME YOJ HOME_VAL CAR_AGE
## 42 1 AGE INCOME YOJ HOME_VAL CAR_AGE
## 43 1 AGE INCOME YOJ HOME_VAL CAR_AGE
## 44 1 AGE INCOME YOJ HOME_VAL CAR_AGE
## 45 1 AGE INCOME YOJ HOME_VAL CAR_AGE
## 46 1 AGE INCOME YOJ HOME_VAL CAR_AGE
## 47 1 AGE INCOME YOJ HOME_VAL CAR_AGE
## 48 1 AGE INCOME YOJ HOME_VAL CAR_AGE
## 49 1 AGE INCOME YOJ HOME_VAL CAR_AGE
## 50 1 AGE INCOME YOJ HOME_VAL CAR_AGE
In this step, several predictive models are built separately for the binary response, and for the continous response. In order to measure model performance and select the best one, 20% of the training data are held out and used for out-of-sample testing of the models. In order to reduce overfitting, the parameters for each model are estimated using 5-fold cross-validation repeated 5 times using the functions of the caret package.
In this step, several logistic regression models are be built to predict the TARGET_FLAG
class assignment.
Regarding in-sample performance, the model accuracy will be compared to the baseline of 73.6% which would occur if the model assigned each observation to the most frequent class in the training data (when TARGET_FLAG
equals 0).
The first model considered is the model with all of the predictors. While this model can be overfitting the data due to the issues highlighted in the data exploration step, it could be a good reference for further simpler models in terms of the accuracy (as the accuracy of a better model should not be significantly worse than that of the full model).
Model summary
##
## Call:
## NULL
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.5547 -0.7167 -0.4086 0.6327 3.1330
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -9.640e-01 3.569e-01 -2.701 0.006909
## KIDSDRIV 4.197e-01 6.754e-02 6.214 5.16e-10
## HOMEKIDS 4.094e-02 4.064e-02 1.007 0.313712
## PARENT1Yes 3.791e-01 1.220e-01 3.107 0.001889
## MSTATUSz_No 4.745e-01 9.235e-02 5.138 2.78e-07
## SEXz_F -1.067e-01 1.244e-01 -0.858 0.391045
## EDUCATIONBachelors -3.252e-01 1.279e-01 -2.543 0.010984
## EDUCATIONMasters -1.317e-01 1.954e-01 -0.674 0.500373
## EDUCATIONPhD -1.127e-01 2.354e-01 -0.479 0.632142
## EDUCATIONz_High_School 1.003e-02 1.059e-01 0.095 0.924569
## JOBClerical 5.655e-01 2.185e-01 2.588 0.009657
## JOBDoctor -2.693e-01 2.937e-01 -0.917 0.359140
## JOBHome_Maker 3.107e-01 2.341e-01 1.327 0.184510
## JOBLawyer 1.411e-01 1.886e-01 0.748 0.454219
## JOBManager -4.364e-01 1.888e-01 -2.312 0.020767
## JOBProfessional 2.115e-01 1.984e-01 1.066 0.286463
## JOBStudent 2.864e-01 2.389e-01 1.199 0.230610
## JOBz_Blue_Collar 3.502e-01 2.064e-01 1.697 0.089662
## TRAVTIME 1.187e-02 2.078e-03 5.712 1.12e-08
## CAR_USEPrivate -7.872e-01 1.016e-01 -7.746 9.50e-15
## BLUEBOOK -2.284e-05 5.888e-06 -3.879 0.000105
## TIF -5.394e-02 8.122e-03 -6.640 3.13e-11
## CAR_TYPEPanel_Truck 5.596e-01 1.816e-01 3.082 0.002058
## CAR_TYPEPickup 6.499e-01 1.119e-01 5.806 6.39e-09
## CAR_TYPESports_Car 1.085e+00 1.436e-01 7.557 4.14e-14
## CAR_TYPEVan 5.900e-01 1.435e-01 4.110 3.95e-05
## CAR_TYPEz_SUV 8.606e-01 1.231e-01 6.991 2.73e-12
## RED_CARyes -5.820e-03 9.654e-02 -0.060 0.951933
## OLDCLAIM -1.470e-05 4.358e-06 -3.372 0.000745
## CLM_FREQ 2.034e-01 3.177e-02 6.401 1.54e-10
## REVOKEDYes 8.978e-01 1.003e-01 8.954 < 2e-16
## MVR_PTS 1.126e-01 1.530e-02 7.359 1.86e-13
## `URBANICITYz_Highly_Rural/ Rural` -2.249e+00 1.226e-01 -18.340 < 2e-16
## AGE -1.924e-03 4.472e-03 -0.430 0.667025
## INCOME -3.173e-06 1.202e-06 -2.639 0.008306
## YOJ -9.567e-03 9.386e-03 -1.019 0.308112
## HOME_VAL -1.117e-06 3.753e-07 -2.975 0.002929
## CAR_AGE -2.438e-03 8.050e-03 -0.303 0.761946
##
## (Intercept) **
## KIDSDRIV ***
## HOMEKIDS
## PARENT1Yes **
## MSTATUSz_No ***
## SEXz_F
## EDUCATIONBachelors *
## EDUCATIONMasters
## EDUCATIONPhD
## EDUCATIONz_High_School
## JOBClerical **
## JOBDoctor
## JOBHome_Maker
## JOBLawyer
## JOBManager *
## JOBProfessional
## JOBStudent
## JOBz_Blue_Collar .
## TRAVTIME ***
## CAR_USEPrivate ***
## BLUEBOOK ***
## TIF ***
## CAR_TYPEPanel_Truck **
## CAR_TYPEPickup ***
## CAR_TYPESports_Car ***
## CAR_TYPEVan ***
## CAR_TYPEz_SUV ***
## RED_CARyes
## OLDCLAIM ***
## CLM_FREQ ***
## REVOKEDYes ***
## MVR_PTS ***
## `URBANICITYz_Highly_Rural/ Rural` ***
## AGE
## INCOME **
## YOJ
## HOME_VAL **
## CAR_AGE
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 7536.3 on 6529 degrees of freedom
## Residual deviance: 5888.9 on 6492 degrees of freedom
## AIC: 5964.9
##
## Number of Fisher Scoring iterations: 5
parameter | Accuracy | Kappa | AccuracySD | KappaSD |
---|---|---|---|---|
none | 0.7848392 | 0.3710277 | 0.0095793 | 0.0293947 |
## Overall
## KIDSDRIV 6.21419200
## HOMEKIDS 1.00746441
## PARENT1Yes 3.10719143
## MSTATUSz_No 5.13767520
## SEXz_F 0.85772311
## EDUCATIONBachelors 2.54320957
## EDUCATIONMasters 0.67390352
## EDUCATIONPhD 0.47871352
## EDUCATIONz_High_School 0.09467951
## JOBClerical 2.58786296
## JOBDoctor 0.91700582
## JOBHome_Maker 1.32699673
## JOBLawyer 0.74840014
## JOBManager 2.31219660
## JOBProfessional 1.06591262
## JOBStudent 1.19878826
## JOBz_Blue_Collar 1.69718443
## TRAVTIME 5.71210541
## CAR_USEPrivate 7.74582185
## BLUEBOOK 3.87861305
## TIF 6.64025253
## CAR_TYPEPanel_Truck 3.08170932
## CAR_TYPEPickup 5.80630969
## CAR_TYPESports_Car 7.55655253
## CAR_TYPEVan 4.11024849
## CAR_TYPEz_SUV 6.99116971
## RED_CARyes 0.06027962
## OLDCLAIM 3.37232971
## CLM_FREQ 6.40122610
## REVOKEDYes 8.95392742
## MVR_PTS 7.35853989
## `URBANICITYz_Highly_Rural/ Rural` 18.33972124
## AGE 0.43023411
## INCOME 2.63938146
## YOJ 1.01919148
## HOME_VAL 2.97510101
## CAR_AGE 0.30292573
From the model summary we can see the following:
URBANICITY: z_Highly Rural/ Rural
(-), CAR_TYPE
(effect strength highest for the Sports Car type), REVOKED: Yes
(+), CAR_USE: Private
(-), MVR_PTS
(+), and TRAVTIME
(+).Interpretation of the regression coefficients
Multiple assumptions listed in the dataset description are confirmed by the data at hand. A car crash is less likely for a driver living in a rural area, driving an expensive car only for private purposes (not as a job), and driving not too often. If this person has not had their license revoked in the past 7 years, and is a manager living in an expensive house and is a parent, the chances for an accident are further decreased.
On the other hand, a more likely accident participant is a person who frequently drives their sports car or SUV, lives in a city, had their licence revoked / has a high number of MVR points and has already claimed accident insurance several times in the past. The chances are further increased if they have teenage children who can drive their car as well.
The diagnostic plots for the model can be generated using the R code provided in the appendix.
For the second model, the following changes are made:
AGE
, YOJ
, SEX
, RED_CAR
, CAR_AGE
Model summary
##
## Call:
## NULL
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.5580 -0.7171 -0.4087 0.6337 3.1319
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.407035 0.038304 -36.733 < 2e-16
## KIDSDRIV 0.213556 0.034074 6.267 3.67e-10
## HOMEKIDS 0.044324 0.041900 1.058 0.290125
## PARENT1Yes 0.130584 0.040985 3.186 0.001442
## MSTATUSz_No 0.236731 0.045017 5.259 1.45e-07
## EDUCATIONBachelors -0.149129 0.053968 -2.763 0.005722
## EDUCATIONMasters -0.060779 0.071246 -0.853 0.393607
## EDUCATIONPhD -0.038183 0.063695 -0.599 0.548867
## EDUCATIONz_High_School 0.003154 0.047735 0.066 0.947321
## JOBClerical 0.204379 0.079016 2.587 0.009694
## JOBDoctor -0.044751 0.049428 -0.905 0.365264
## JOBHome_Maker 0.091013 0.060578 1.502 0.132990
## JOBLawyer 0.041449 0.056926 0.728 0.466536
## JOBManager -0.143178 0.061783 -2.317 0.020481
## JOBProfessional 0.072877 0.068695 1.061 0.288743
## JOBStudent 0.092245 0.066118 1.395 0.162969
## JOBz_Blue_Collar 0.146933 0.086086 1.707 0.087856
## TRAVTIME 0.188691 0.033077 5.705 1.17e-08
## CAR_USEPrivate -0.380361 0.049035 -7.757 8.71e-15
## BLUEBOOK -0.214077 0.044449 -4.816 1.46e-06
## TIF -0.224648 0.033757 -6.655 2.84e-11
## CAR_TYPEPanel_Truck 0.173621 0.046802 3.710 0.000207
## CAR_TYPEPickup 0.245621 0.042209 5.819 5.91e-09
## CAR_TYPESports_Car 0.316079 0.037460 8.438 < 2e-16
## CAR_TYPEVan 0.180455 0.039724 4.543 5.55e-06
## CAR_TYPEz_SUV 0.354496 0.043156 8.214 < 2e-16
## OLDCLAIM -0.129724 0.038041 -3.410 0.000649
## CLM_FREQ 0.235571 0.036695 6.420 1.37e-10
## REVOKEDYes 0.296062 0.032993 8.973 < 2e-16
## MVR_PTS 0.241905 0.032633 7.413 1.24e-13
## `URBANICITYz_Highly_Rural/ Rural` -0.902502 0.049173 -18.354 < 2e-16
## INCOME -0.157709 0.056454 -2.794 0.005213
## HOME_VAL -0.146600 0.048473 -3.024 0.002492
##
## (Intercept) ***
## KIDSDRIV ***
## HOMEKIDS
## PARENT1Yes **
## MSTATUSz_No ***
## EDUCATIONBachelors **
## EDUCATIONMasters
## EDUCATIONPhD
## EDUCATIONz_High_School
## JOBClerical **
## JOBDoctor
## JOBHome_Maker
## JOBLawyer
## JOBManager *
## JOBProfessional
## JOBStudent
## JOBz_Blue_Collar .
## TRAVTIME ***
## CAR_USEPrivate ***
## BLUEBOOK ***
## TIF ***
## CAR_TYPEPanel_Truck ***
## CAR_TYPEPickup ***
## CAR_TYPESports_Car ***
## CAR_TYPEVan ***
## CAR_TYPEz_SUV ***
## OLDCLAIM ***
## CLM_FREQ ***
## REVOKEDYes ***
## MVR_PTS ***
## `URBANICITYz_Highly_Rural/ Rural` ***
## INCOME **
## HOME_VAL **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 7536.3 on 6529 degrees of freedom
## Residual deviance: 5891.1 on 6497 degrees of freedom
## AIC: 5957.1
##
## Number of Fisher Scoring iterations: 5
## parameter Accuracy Kappa AccuracySD KappaSD
## 1 none 0.7860639 0.3741955 0.009405976 0.02867255
## glm variable importance
##
## only 20 most important variables shown (out of 32)
##
## Overall
## `URBANICITYz_Highly_Rural/ Rural` 18.354
## REVOKEDYes 8.973
## CAR_TYPESports_Car 8.438
## CAR_TYPEz_SUV 8.214
## CAR_USEPrivate 7.757
## MVR_PTS 7.413
## TIF 6.655
## CLM_FREQ 6.420
## KIDSDRIV 6.267
## CAR_TYPEPickup 5.819
## TRAVTIME 5.705
## MSTATUSz_No 5.259
## BLUEBOOK 4.816
## CAR_TYPEVan 4.543
## CAR_TYPEPanel_Truck 3.710
## OLDCLAIM 3.410
## PARENT1Yes 3.186
## HOME_VAL 3.024
## INCOME 2.794
## EDUCATIONBachelors 2.763
We can see that in this reduced model, all predictors are now significant, and the accuracy has remained the same, and the AIC has grown only marginally.
The interpretation of the coefficients has stayed the same as in the full model.
The third model is build using LASSO, a regularized regression approach from the glmnet package that that fits a generalized linear model via penalized maximum likelihood. As in the second model, the continuous predictors are centered and scaled.
The penalty parameter is chosen automatically using cross-validation.
Model summary
alpha | lambda | Accuracy | Kappa | AccuracySD | KappaSD | |
---|---|---|---|---|---|---|
8 | 1 | 0.0019113 | 0.7866165 | 0.3670499 | 0.0108713 | 0.0304967 |
## # A tibble: 38 x 2
## beta predictor
## <dbl> <chr>
## 1 -2.12 URBANICITYz_Highly_Rural/ Rural
## 2 0.867 CAR_TYPESports_Car
## 3 0.821 REVOKEDYes
## 4 -0.771 CAR_USEPrivate
## 5 0.664 CAR_TYPEz_SUV
## 6 -0.660 (Intercept)
## 7 -0.577 JOBManager
## 8 0.505 CAR_TYPEPickup
## 9 0.432 MSTATUSz_No
## 10 0.431 CAR_TYPEVan
## # ... with 28 more rows
We can see that the LASSO model selection has resulted in setting the coefficients for the low-importance predictors to zero or nearly zero. In this way, the LASSO model does automated variable selection. The low-importance predictors are the similar to those identified in the models above: RED_CAR
, YOJ
, CAR_AGE
, AGE
; however, also the predictors that are significant in the other models have received a penalty: AGE
, INCOME
, HOME_VAL
, BLUEBOOK
.
The model accuracy is on par with the full model at 78.5%.
The interpretation of the model coefficients for the most important variables remains the same as above.
In this section, two multinomial models are built for the continous response variable provided in the dataset: TARGET_AMT
- the value of the insurance claim in the case when there was an accident.
Analog to the previous section, the first model built is a full model with all the available predictors. The second model uses a reduced set of predictors, scaled and centered variables, and excludes outliers.
The performance of both models is compared on the same out-of-sample dataset and measured based on the adjusted R-squared and RMSE.
The first model built for continous data is a full model with non-transformed predictors.
##
## Call:
## lm(formula = .outcome ~ ., data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5304 -1692 -789 346 103654
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.533e+02 6.372e+02 1.182 0.237168
## KIDSDRIV 3.362e+02 1.318e+02 2.550 0.010789
## HOMEKIDS 7.964e+01 7.541e+01 1.056 0.290997
## PARENT1Yes 4.664e+02 2.326e+02 2.005 0.044995
## MSTATUSz_No 5.749e+02 1.646e+02 3.492 0.000483
## SEXz_F -5.042e+02 2.132e+02 -2.365 0.018056
## EDUCATIONBachelors -1.366e+02 2.336e+02 -0.585 0.558689
## EDUCATIONMasters 1.196e+02 3.416e+02 0.350 0.726216
## EDUCATIONPhD 5.253e+02 4.073e+02 1.290 0.197160
## EDUCATIONz_High_School -3.873e+01 1.977e+02 -0.196 0.844705
## JOBClerical 5.962e+02 3.937e+02 1.514 0.130026
## JOBDoctor -5.573e+02 4.597e+02 -1.212 0.225437
## JOBHome_Maker 4.687e+02 4.200e+02 1.116 0.264468
## JOBLawyer 3.300e+02 3.407e+02 0.969 0.332722
## JOBManager -4.065e+02 3.301e+02 -1.232 0.218166
## JOBProfessional 6.563e+02 3.565e+02 1.841 0.065690
## JOBStudent 4.068e+02 4.307e+02 0.944 0.344973
## JOBz_Blue_Collar 6.113e+02 3.708e+02 1.649 0.099246
## TRAVTIME 1.186e+01 3.710e+00 3.198 0.001390
## CAR_USEPrivate -8.235e+02 1.897e+02 -4.340 1.44e-05
## BLUEBOOK 1.739e-02 9.964e-03 1.745 0.080986
## TIF -4.177e+01 1.398e+01 -2.987 0.002825
## CAR_TYPEPanel_Truck 2.730e+02 3.212e+02 0.850 0.395465
## CAR_TYPEPickup 4.387e+02 1.962e+02 2.236 0.025398
## CAR_TYPESports_Car 1.121e+03 2.521e+02 4.447 8.84e-06
## CAR_TYPEVan 6.023e+02 2.425e+02 2.483 0.013044
## CAR_TYPEz_SUV 8.603e+02 2.078e+02 4.140 3.51e-05
## RED_CARyes -2.651e+02 1.710e+02 -1.550 0.121169
## OLDCLAIM -1.191e-02 8.750e-03 -1.361 0.173635
## CLM_FREQ 1.391e+02 6.389e+01 2.178 0.029465
## REVOKEDYes 4.762e+02 2.024e+02 2.353 0.018656
## MVR_PTS 1.666e+02 2.975e+01 5.600 2.23e-08
## `URBANICITYz_Highly_Rural/ Rural` -1.656e+03 1.612e+02 -10.276 < 2e-16
## AGE 1.143e+01 8.162e+00 1.400 0.161480
## INCOME -3.970e-03 2.088e-03 -1.901 0.057335
## YOJ -2.144e+00 1.693e+01 -0.127 0.899238
## HOME_VAL -9.419e-04 6.694e-04 -1.407 0.159488
## CAR_AGE -3.468e+01 1.401e+01 -2.476 0.013302
##
## (Intercept)
## KIDSDRIV *
## HOMEKIDS
## PARENT1Yes *
## MSTATUSz_No ***
## SEXz_F *
## EDUCATIONBachelors
## EDUCATIONMasters
## EDUCATIONPhD
## EDUCATIONz_High_School
## JOBClerical
## JOBDoctor
## JOBHome_Maker
## JOBLawyer
## JOBManager
## JOBProfessional .
## JOBStudent
## JOBz_Blue_Collar .
## TRAVTIME **
## CAR_USEPrivate ***
## BLUEBOOK .
## TIF **
## CAR_TYPEPanel_Truck
## CAR_TYPEPickup *
## CAR_TYPESports_Car ***
## CAR_TYPEVan *
## CAR_TYPEz_SUV ***
## RED_CARyes
## OLDCLAIM
## CLM_FREQ *
## REVOKEDYes *
## MVR_PTS ***
## `URBANICITYz_Highly_Rural/ Rural` ***
## AGE
## INCOME .
## YOJ
## HOME_VAL
## CAR_AGE *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4667 on 6491 degrees of freedom
## Multiple R-squared: 0.06712, Adjusted R-squared: 0.0618
## F-statistic: 12.62 on 37 and 6491 DF, p-value: < 2.2e-16
intercept | RMSE | Rsquared | MAE | RMSESD | RsquaredSD | MAESD |
---|---|---|---|---|---|---|
TRUE | 4647.431 | 0.0583734 | 2013.604 | 599.4799 | 0.0132851 | 80.03826 |
From the model summary we can see that while several predictors are highly significant, the errors are not nearly normal, and the Adjusted R-squared value is very low at 0.06.
The diagnostic plots show:
1) a very severe violation of normality in the residuals for the higher values of the response variable (beyond 2 standard deviations)
2) a number of high-impact residuals that are affecting the model
Overall, the fit is poor due to
1) low correlation between the individual predictors and the outcome 2) the severe skew in the response variable
3) presence of extreme outliers
The second continuous model tries to alleviate the identified problems by: 1) excluding the outlier records
2) excluding the variables that are not correlated with the response
3) centering and scaling the remaining continous predictors
4) applying a log-transformation on the response
##
## Call:
## lm(formula = .outcome ~ ., data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.4387 -2.3645 -0.9325 2.2231 10.3605
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.18234 0.04028 54.174 < 2e-16 ***
## KIDSDRIV 0.27176 0.04180 6.502 8.50e-11 ***
## PARENT1Yes 0.21993 0.04771 4.610 4.11e-06 ***
## MSTATUSz_No 0.37264 0.04676 7.969 1.88e-15 ***
## SEXz_F -0.12521 0.05921 -2.115 0.034501 *
## JOBClerical 0.29740 0.08516 3.492 0.000482 ***
## JOBDoctor -0.06065 0.05239 -1.158 0.247075
## JOBHome_Maker 0.16410 0.07257 2.261 0.023764 *
## JOBLawyer 0.05130 0.06981 0.735 0.462472
## JOBManager -0.22458 0.07020 -3.199 0.001386 **
## JOBProfessional 0.08289 0.07342 1.129 0.258906
## JOBStudent 0.23115 0.07354 3.143 0.001678 **
## JOBz_Blue_Collar 0.26908 0.08935 3.012 0.002609 **
## TRAVTIME 0.22905 0.04106 5.578 2.53e-08 ***
## CAR_USEPrivate -0.45599 0.06072 -7.510 6.71e-14 ***
## TIF -0.25567 0.04042 -6.325 2.70e-10 ***
## CAR_TYPEPanel_Truck 0.10666 0.05459 1.954 0.050783 .
## CAR_TYPEPickup 0.28875 0.05063 5.703 1.23e-08 ***
## CAR_TYPESports_Car 0.42895 0.05167 8.302 < 2e-16 ***
## CAR_TYPEVan 0.18453 0.04796 3.848 0.000120 ***
## CAR_TYPEz_SUV 0.49695 0.05992 8.293 < 2e-16 ***
## CLM_FREQ 0.22782 0.04533 5.026 5.14e-07 ***
## REVOKEDYes 0.33497 0.04069 8.232 < 2e-16 ***
## MVR_PTS 0.39157 0.04433 8.832 < 2e-16 ***
## `URBANICITYz_Highly_Rural/ Rural` -0.96849 0.04513 -21.462 < 2e-16 ***
## INCOME -0.30449 0.05752 -5.294 1.24e-07 ***
## CAR_AGE -0.11571 0.04954 -2.336 0.019531 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.254 on 6499 degrees of freedom
## Multiple R-squared: 0.217, Adjusted R-squared: 0.2139
## F-statistic: 69.29 on 26 and 6499 DF, p-value: < 2.2e-16
intercept | RMSE | Rsquared | MAE | RMSESD | RsquaredSD | MAESD |
---|---|---|---|---|---|---|
TRUE | 3.262434 | 0.2105621 | 2.651201 | 0.0361453 | 0.0164339 | 0.0334493 |
From the model summary we can see that the log-transformation of the response has helped solve the problem with the distribution of the residuals. As a result, the Adjusted R-squared metric has grown to 0.22. However, there is still a very strange remaining pattern in the standardized residuals.
For the model selection step, the three models build in the previous section will be evaluated on the 20% out-of-sample data not used in the model building process. The predicted class is assigned at >50% probability.
Then, the following classification performance metrics will be compared: (a) accuracy, (b) classification error rate, (c) precision, (d) sensitivity, (e) specificity, (f) F1 score, (g) AUC, and (h) confusion matrix. The best performing model is the one with the highest F1 score and AUC values, as these metrics capture model sensitivity, specificity, and overall performance independent from the class cutoff threshold.
The three confusion matrices are provided below.
Confusion matrix: Full model (model 1)
## Reference
## Prediction 0 1
## 0 1120 242
## 1 81 188
Confusion matrix: manually reduced model (model 2)
## Reference
## Prediction 0 1
## 0 1120 243
## 1 81 187
Confusion matrix: stepwise selection model (model 3)
## Reference
## Prediction 0 1
## 0 1128 254
## 1 73 176
Looking at table comparing classification performance metrics between the models provided below, we can see that the models are extremely similar in their performance. Therefore in practical terms the model m2 could still be considered as the final model, as it is easier to understand due to a lower number of predictors.
accuracy | class_error_rate | precision | sensitivity | specificity | f1_score | auc | |
---|---|---|---|---|---|---|---|
m1 | 0.8019620 | 0.1980380 | 0.6988848 | 0.4372093 | 0.9325562 | 0.5379113 | 0.6848828 |
m2 | 0.8013489 | 0.1986511 | 0.6977612 | 0.4348837 | 0.9325562 | 0.5358166 | 0.6837200 |
m3 | 0.7995095 | 0.2004905 | 0.7068273 | 0.4093023 | 0.9392173 | 0.5184094 | 0.6742598 |
ROC Curves for the three models
Classification accuracy between the models can be also compared using ROC curves.
A comparison of the ROC curves shows that the models are indeed very similar in performance.
Predictions on the evaluation dataset
Predictions on the evaluation dataset are made using the model m3.
The output of the model on the evaluated data is available under the following URL: model_m3_eval_predictions.csv
The performance of the continous models will be compared based on RMSE on the out-of sample data
model | RMSE |
---|---|
model1 | 4034.550 |
model2 | 4427.868 |
The RMSE for the first (full) model is lower. From the charts below it is clear that the model two consistently produces very low values as compared to the true result.
So the initial full model will be selected for now to produce predictions on the evaluation data. However, further tuning could provide better precision of the predictions.
Predictions on the evaluation dataset
Predictions on the evaluation dataset are made using the model m1_cont.
The output of the model on the evaluated data is available under the following URL: model_m1_cont_eval_predictions.csv
The full R code for the analysis in Rmd format is available under the following URL: hw4_insurance_data.Rmd
https://cran.r-project.org/web/packages/DataExplorer/vignettes/dataexplorer-intro.html https://www.statmethods.net/advgraphs/trellis.html http://topepo.github.io/caret/visualizations.html https://www.analyticsvidhya.com/blog/2016/03/tutorial-powerful-packages-imputing-missing-values/ https://web.stanford.edu/~hastie/glmnet/glmnet_alpha.html#lin https://stackoverflow.com/questions/35247522/error-in-cross-validation-in-glmnet-package-r-for-binomial-target-variable