Management summary

This project is based on a dataset from an auto insurance company’s customers. It builds two predictive models that estimate a) the probability that a customer would have a car accident and b) the monetary amount of insurance claims in case of the accident.
After an initial variable inspection, three logistic regression models, and two multiple linear regression models were prepared and compared on test data.
Based on classification performance metrics, the best model is suggested and applied on the evaluation dataset.

1. DATA EXPLORATION

The training dataset contains 8161 observations of 26 variables (one index, two response, and 23 predictor variables).
Each record (row) represents a set of attributes of an insurance company individual customer that are related to their socio-demographic profile and the insured vehicle. The binary response variable TARGET_FLAG has 1 if the customer’s car was in a crash, and 0 if not. The continous response variable TARGET_AMT defines the cost related to the car crash if it happened.

The variables are:

1.1. Univariate analysis

Summaries for the individual variables (after some cleaning) are provided below.

##      INDEX       TARGET_FLAG   TARGET_AMT        KIDSDRIV     
##  Min.   :    1   0:6008      Min.   :     0   Min.   :0.0000  
##  1st Qu.: 2559   1:2153      1st Qu.:     0   1st Qu.:0.0000  
##  Median : 5133               Median :     0   Median :0.0000  
##  Mean   : 5152               Mean   :  1504   Mean   :0.1711  
##  3rd Qu.: 7745               3rd Qu.:  1036   3rd Qu.:0.0000  
##  Max.   :10302               Max.   :107586   Max.   :4.0000  
##                                                               
##       AGE           HOMEKIDS           YOJ           INCOME      
##  Min.   :16.00   Min.   :0.0000   Min.   : 0.0   Min.   :     0  
##  1st Qu.:39.00   1st Qu.:0.0000   1st Qu.: 9.0   1st Qu.: 28097  
##  Median :45.00   Median :0.0000   Median :11.0   Median : 54028  
##  Mean   :44.79   Mean   :0.7212   Mean   :10.5   Mean   : 61898  
##  3rd Qu.:51.00   3rd Qu.:1.0000   3rd Qu.:13.0   3rd Qu.: 85986  
##  Max.   :81.00   Max.   :5.0000   Max.   :23.0   Max.   :367030  
##  NA's   :6                        NA's   :454    NA's   :445     
##  PARENT1       HOME_VAL      MSTATUS      SEX               EDUCATION   
##  No :7084   Min.   :     0   Yes :4894   M  :3786   <High_School :1203  
##  Yes:1077   1st Qu.:     0   z_No:3267   z_F:4375   Bachelors    :2242  
##             Median :161160                          Masters      :1658  
##             Mean   :154867                          PhD          : 728  
##             3rd Qu.:238724                          z_High_School:2330  
##             Max.   :885282                                              
##             NA's   :464                                                 
##             JOB          TRAVTIME            CAR_USE        BLUEBOOK    
##  z_Blue_Collar:1825   Min.   :  5.00   Commercial:3029   Min.   : 1500  
##  Clerical     :1271   1st Qu.: 22.00   Private   :5132   1st Qu.: 9280  
##  Professional :1117   Median : 33.00                     Median :14440  
##  Manager      : 988   Mean   : 33.49                     Mean   :15710  
##  Lawyer       : 835   3rd Qu.: 44.00                     3rd Qu.:20850  
##  Student      : 712   Max.   :142.00                     Max.   :69740  
##  (Other)      :1413                                                     
##       TIF                CAR_TYPE    RED_CAR       OLDCLAIM    
##  Min.   : 1.000   Minivan    :2145   no :5783   Min.   :    0  
##  1st Qu.: 1.000   Panel_Truck: 676   yes:2378   1st Qu.:    0  
##  Median : 4.000   Pickup     :1389              Median :    0  
##  Mean   : 5.351   Sports_Car : 907              Mean   : 4037  
##  3rd Qu.: 7.000   Van        : 750              3rd Qu.: 4636  
##  Max.   :25.000   z_SUV      :2294              Max.   :57037  
##                                                                
##     CLM_FREQ      REVOKED       MVR_PTS          CAR_AGE      
##  Min.   :0.0000   No :7161   Min.   : 0.000   Min.   :-3.000  
##  1st Qu.:0.0000   Yes:1000   1st Qu.: 0.000   1st Qu.: 1.000  
##  Median :0.0000              Median : 1.000   Median : 8.000  
##  Mean   :0.7986              Mean   : 1.696   Mean   : 8.328  
##  3rd Qu.:2.0000              3rd Qu.: 3.000   3rd Qu.:12.000  
##  Max.   :5.0000              Max.   :13.000   Max.   :28.000  
##                                               NA's   :510     
##                  URBANICITY  
##  Highly_Urban/ Urban  :6492  
##  z_Highly_Rural/ Rural:1669  
##                              
##                              
##                              
##                              
## 

From the summaries and the chart above we can see that multiple variables have missing data, but the amount of NAs is not very high.

Frequency counts of class occurrence for the descrete variables are provided below

Histograms of the distributions of the remaining continuous variables are provided below

We can see that - as expected based on the description above - the values of the variables are non-negative. This means that we should assume a gamma distribution as the generating function, which impacts the choice of regression models later on. In addition, the variables KIDSDRIV, HOMEKIDS, OLDCLAIM, and HOMEVAL have a significant share of observations that are equal to zero and do not match the rest of the distribution of the data.

A check for near-zero variance did not show a positive result for any variable.

1.2. Bivariate analysis

The pairwise correlations between the continuous variables are displayed below

This analysis shows generally weak correlations between the continous predictors and the continous response, as well as between individual predictors. However, several predictors do show moderately strong relationships:

  • The variables describing various aspects of income are positively correlated with each other and with age and years in the same job: INCOME, HOME_VAL (home value), BLUEBOOK (car value), AGE, YOJ (years in the same job)
  • HOMEKIDS (the number of children at home ) is negatively linked with age, income, and age of the car
  • CLM_FREQ (claim frequency in the past 5 years), as well as OLDCLAIM (the total claimed amount), and MVR-PTS (the number of Motor Vehicle Recors Points) are weakly positively linked with TARGET_AMT (the payout in case the car was in a crash)

These relationships and their connection to the target class are inspected in scatter plots provided in the appendix.

Pairwise relationship with the binary target variable

First we inspect the relationship between the binary outcome TARGET_FLAG and the continous predictors using boxplots.

Analyzing the boxplots we can see that while there is some variance in the location of the medians per level of the target variable, none of the predictors by itself appears to be particularly informative for the target. This confirms the finding on weak correlations with the continous target variable discussed above.

Moving to discrete predictors, we can inspect the relative frequencies of occurrence of each factor level in conjunction with the target level (0 or 1) using mosaic plots.

From the inspection of the mosaic plots we can conclude that most of the discrete predictors carry information related to the level of the outcome variable, with the exception of SEX, and RED_CAR. Based on the distribution of the data, neither being male or female, nor having a red car plays a role in the probability of being in a car crash.

Summary of the findings

  1. The distributions of the continous predictors resemble a gamma distribution with some exceptions regarding high counts of zero values
  2. There is some collinearity between multiple continous predictors
  3. There are also weak correlations between most continous predictors and the continous response
  4. Most continous predictors carry little information regarding the binary response
  5. With the exception of SEX, and RED_CAR, the discrete predictors appear highly relevant for predicting the binary outcome

2. DATA PREPROCESSING

2.1. Data cleaning

The variables encoded as strings in the input data representing dollar values, e.g. “$21,100” were converted to numeric variables.

In addition, spaces in factor variables’ levels were replaced by underscores in order to comply with the requirements of the models generating dummy variables from these factors.

2.1. Missing data and near-zero variables

As discussed above, several variables in the dataset have missing observations. However, none of the predictors show near-zero variance.

Instead of removing all rows with incomplete observations, an imputation of missing data using the predictive mean matching approach implemented in the mice package is applied.
In addition, the continous values were also centered and scaled in several models that were built.

## 
##  iter imp variable
##   1   1  AGE  INCOME  YOJ  HOME_VAL  CAR_AGE
##   2   1  AGE  INCOME  YOJ  HOME_VAL  CAR_AGE
##   3   1  AGE  INCOME  YOJ  HOME_VAL  CAR_AGE
##   4   1  AGE  INCOME  YOJ  HOME_VAL  CAR_AGE
##   5   1  AGE  INCOME  YOJ  HOME_VAL  CAR_AGE
##   6   1  AGE  INCOME  YOJ  HOME_VAL  CAR_AGE
##   7   1  AGE  INCOME  YOJ  HOME_VAL  CAR_AGE
##   8   1  AGE  INCOME  YOJ  HOME_VAL  CAR_AGE
##   9   1  AGE  INCOME  YOJ  HOME_VAL  CAR_AGE
##   10   1  AGE  INCOME  YOJ  HOME_VAL  CAR_AGE
##   11   1  AGE  INCOME  YOJ  HOME_VAL  CAR_AGE
##   12   1  AGE  INCOME  YOJ  HOME_VAL  CAR_AGE
##   13   1  AGE  INCOME  YOJ  HOME_VAL  CAR_AGE
##   14   1  AGE  INCOME  YOJ  HOME_VAL  CAR_AGE
##   15   1  AGE  INCOME  YOJ  HOME_VAL  CAR_AGE
##   16   1  AGE  INCOME  YOJ  HOME_VAL  CAR_AGE
##   17   1  AGE  INCOME  YOJ  HOME_VAL  CAR_AGE
##   18   1  AGE  INCOME  YOJ  HOME_VAL  CAR_AGE
##   19   1  AGE  INCOME  YOJ  HOME_VAL  CAR_AGE
##   20   1  AGE  INCOME  YOJ  HOME_VAL  CAR_AGE
##   21   1  AGE  INCOME  YOJ  HOME_VAL  CAR_AGE
##   22   1  AGE  INCOME  YOJ  HOME_VAL  CAR_AGE
##   23   1  AGE  INCOME  YOJ  HOME_VAL  CAR_AGE
##   24   1  AGE  INCOME  YOJ  HOME_VAL  CAR_AGE
##   25   1  AGE  INCOME  YOJ  HOME_VAL  CAR_AGE
##   26   1  AGE  INCOME  YOJ  HOME_VAL  CAR_AGE
##   27   1  AGE  INCOME  YOJ  HOME_VAL  CAR_AGE
##   28   1  AGE  INCOME  YOJ  HOME_VAL  CAR_AGE
##   29   1  AGE  INCOME  YOJ  HOME_VAL  CAR_AGE
##   30   1  AGE  INCOME  YOJ  HOME_VAL  CAR_AGE
##   31   1  AGE  INCOME  YOJ  HOME_VAL  CAR_AGE
##   32   1  AGE  INCOME  YOJ  HOME_VAL  CAR_AGE
##   33   1  AGE  INCOME  YOJ  HOME_VAL  CAR_AGE
##   34   1  AGE  INCOME  YOJ  HOME_VAL  CAR_AGE
##   35   1  AGE  INCOME  YOJ  HOME_VAL  CAR_AGE
##   36   1  AGE  INCOME  YOJ  HOME_VAL  CAR_AGE
##   37   1  AGE  INCOME  YOJ  HOME_VAL  CAR_AGE
##   38   1  AGE  INCOME  YOJ  HOME_VAL  CAR_AGE
##   39   1  AGE  INCOME  YOJ  HOME_VAL  CAR_AGE
##   40   1  AGE  INCOME  YOJ  HOME_VAL  CAR_AGE
##   41   1  AGE  INCOME  YOJ  HOME_VAL  CAR_AGE
##   42   1  AGE  INCOME  YOJ  HOME_VAL  CAR_AGE
##   43   1  AGE  INCOME  YOJ  HOME_VAL  CAR_AGE
##   44   1  AGE  INCOME  YOJ  HOME_VAL  CAR_AGE
##   45   1  AGE  INCOME  YOJ  HOME_VAL  CAR_AGE
##   46   1  AGE  INCOME  YOJ  HOME_VAL  CAR_AGE
##   47   1  AGE  INCOME  YOJ  HOME_VAL  CAR_AGE
##   48   1  AGE  INCOME  YOJ  HOME_VAL  CAR_AGE
##   49   1  AGE  INCOME  YOJ  HOME_VAL  CAR_AGE
##   50   1  AGE  INCOME  YOJ  HOME_VAL  CAR_AGE

3. BUILD MODELS

In this step, several predictive models are built separately for the binary response, and for the continous response. In order to measure model performance and select the best one, 20% of the training data are held out and used for out-of-sample testing of the models. In order to reduce overfitting, the parameters for each model are estimated using 5-fold cross-validation repeated 5 times using the functions of the caret package.

3.1. BUILD MODELS FOR THE BINARY RESPONSE

In this step, several logistic regression models are be built to predict the TARGET_FLAG class assignment.

Regarding in-sample performance, the model accuracy will be compared to the baseline of 73.6% which would occur if the model assigned each observation to the most frequent class in the training data (when TARGET_FLAG equals 0).

3.1.1. Full model

The first model considered is the model with all of the predictors. While this model can be overfitting the data due to the issues highlighted in the data exploration step, it could be a good reference for further simpler models in terms of the accuracy (as the accuracy of a better model should not be significantly worse than that of the full model).

Model summary

## 
## Call:
## NULL
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.5547  -0.7167  -0.4086   0.6327   3.1330  
## 
## Coefficients:
##                                     Estimate Std. Error z value Pr(>|z|)
## (Intercept)                       -9.640e-01  3.569e-01  -2.701 0.006909
## KIDSDRIV                           4.197e-01  6.754e-02   6.214 5.16e-10
## HOMEKIDS                           4.094e-02  4.064e-02   1.007 0.313712
## PARENT1Yes                         3.791e-01  1.220e-01   3.107 0.001889
## MSTATUSz_No                        4.745e-01  9.235e-02   5.138 2.78e-07
## SEXz_F                            -1.067e-01  1.244e-01  -0.858 0.391045
## EDUCATIONBachelors                -3.252e-01  1.279e-01  -2.543 0.010984
## EDUCATIONMasters                  -1.317e-01  1.954e-01  -0.674 0.500373
## EDUCATIONPhD                      -1.127e-01  2.354e-01  -0.479 0.632142
## EDUCATIONz_High_School             1.003e-02  1.059e-01   0.095 0.924569
## JOBClerical                        5.655e-01  2.185e-01   2.588 0.009657
## JOBDoctor                         -2.693e-01  2.937e-01  -0.917 0.359140
## JOBHome_Maker                      3.107e-01  2.341e-01   1.327 0.184510
## JOBLawyer                          1.411e-01  1.886e-01   0.748 0.454219
## JOBManager                        -4.364e-01  1.888e-01  -2.312 0.020767
## JOBProfessional                    2.115e-01  1.984e-01   1.066 0.286463
## JOBStudent                         2.864e-01  2.389e-01   1.199 0.230610
## JOBz_Blue_Collar                   3.502e-01  2.064e-01   1.697 0.089662
## TRAVTIME                           1.187e-02  2.078e-03   5.712 1.12e-08
## CAR_USEPrivate                    -7.872e-01  1.016e-01  -7.746 9.50e-15
## BLUEBOOK                          -2.284e-05  5.888e-06  -3.879 0.000105
## TIF                               -5.394e-02  8.122e-03  -6.640 3.13e-11
## CAR_TYPEPanel_Truck                5.596e-01  1.816e-01   3.082 0.002058
## CAR_TYPEPickup                     6.499e-01  1.119e-01   5.806 6.39e-09
## CAR_TYPESports_Car                 1.085e+00  1.436e-01   7.557 4.14e-14
## CAR_TYPEVan                        5.900e-01  1.435e-01   4.110 3.95e-05
## CAR_TYPEz_SUV                      8.606e-01  1.231e-01   6.991 2.73e-12
## RED_CARyes                        -5.820e-03  9.654e-02  -0.060 0.951933
## OLDCLAIM                          -1.470e-05  4.358e-06  -3.372 0.000745
## CLM_FREQ                           2.034e-01  3.177e-02   6.401 1.54e-10
## REVOKEDYes                         8.978e-01  1.003e-01   8.954  < 2e-16
## MVR_PTS                            1.126e-01  1.530e-02   7.359 1.86e-13
## `URBANICITYz_Highly_Rural/ Rural` -2.249e+00  1.226e-01 -18.340  < 2e-16
## AGE                               -1.924e-03  4.472e-03  -0.430 0.667025
## INCOME                            -3.173e-06  1.202e-06  -2.639 0.008306
## YOJ                               -9.567e-03  9.386e-03  -1.019 0.308112
## HOME_VAL                          -1.117e-06  3.753e-07  -2.975 0.002929
## CAR_AGE                           -2.438e-03  8.050e-03  -0.303 0.761946
##                                      
## (Intercept)                       ** 
## KIDSDRIV                          ***
## HOMEKIDS                             
## PARENT1Yes                        ** 
## MSTATUSz_No                       ***
## SEXz_F                               
## EDUCATIONBachelors                *  
## EDUCATIONMasters                     
## EDUCATIONPhD                         
## EDUCATIONz_High_School               
## JOBClerical                       ** 
## JOBDoctor                            
## JOBHome_Maker                        
## JOBLawyer                            
## JOBManager                        *  
## JOBProfessional                      
## JOBStudent                           
## JOBz_Blue_Collar                  .  
## TRAVTIME                          ***
## CAR_USEPrivate                    ***
## BLUEBOOK                          ***
## TIF                               ***
## CAR_TYPEPanel_Truck               ** 
## CAR_TYPEPickup                    ***
## CAR_TYPESports_Car                ***
## CAR_TYPEVan                       ***
## CAR_TYPEz_SUV                     ***
## RED_CARyes                           
## OLDCLAIM                          ***
## CLM_FREQ                          ***
## REVOKEDYes                        ***
## MVR_PTS                           ***
## `URBANICITYz_Highly_Rural/ Rural` ***
## AGE                                  
## INCOME                            ** 
## YOJ                                  
## HOME_VAL                          ** 
## CAR_AGE                              
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 7536.3  on 6529  degrees of freedom
## Residual deviance: 5888.9  on 6492  degrees of freedom
## AIC: 5964.9
## 
## Number of Fisher Scoring iterations: 5
parameter Accuracy Kappa AccuracySD KappaSD
none 0.7848392 0.3710277 0.0095793 0.0293947
##                                       Overall
## KIDSDRIV                           6.21419200
## HOMEKIDS                           1.00746441
## PARENT1Yes                         3.10719143
## MSTATUSz_No                        5.13767520
## SEXz_F                             0.85772311
## EDUCATIONBachelors                 2.54320957
## EDUCATIONMasters                   0.67390352
## EDUCATIONPhD                       0.47871352
## EDUCATIONz_High_School             0.09467951
## JOBClerical                        2.58786296
## JOBDoctor                          0.91700582
## JOBHome_Maker                      1.32699673
## JOBLawyer                          0.74840014
## JOBManager                         2.31219660
## JOBProfessional                    1.06591262
## JOBStudent                         1.19878826
## JOBz_Blue_Collar                   1.69718443
## TRAVTIME                           5.71210541
## CAR_USEPrivate                     7.74582185
## BLUEBOOK                           3.87861305
## TIF                                6.64025253
## CAR_TYPEPanel_Truck                3.08170932
## CAR_TYPEPickup                     5.80630969
## CAR_TYPESports_Car                 7.55655253
## CAR_TYPEVan                        4.11024849
## CAR_TYPEz_SUV                      6.99116971
## RED_CARyes                         0.06027962
## OLDCLAIM                           3.37232971
## CLM_FREQ                           6.40122610
## REVOKEDYes                         8.95392742
## MVR_PTS                            7.35853989
## `URBANICITYz_Highly_Rural/ Rural` 18.33972124
## AGE                                0.43023411
## INCOME                             2.63938146
## YOJ                                1.01919148
## HOME_VAL                           2.97510101
## CAR_AGE                            0.30292573

From the model summary we can see the following:

  1. The model accuracy on the training data is somewhat better than the baseline: approx. 78.5% vs. 73.6%. Given that this is a full model with a very high flexibility due to dummy variables generated for each factor level, we can exclude model bias as the reason. This means that the predictors probably do not carry enough information on the response for high-accuracy predictions.
  2. The deviance residuals are not quite normally distributed around zero indicating residual structure in the data not captured by the model.
  3. Based on the z-statistic of the parameters, the most important predictors are (ordered by desceding importance, sign in brackets means direction of the relationship): URBANICITY: z_Highly Rural/ Rural(-), CAR_TYPE(effect strength highest for the Sports Car type), REVOKED: Yes(+), CAR_USE: Private(-), MVR_PTS(+), and TRAVTIME(+).
  4. The predictors related to income, education and parent status are also significant, but have lower effect on the response.
  5. The continous predictors related to driver’s and car age are not significant, just as the discrete factors gender and having a red car.

Interpretation of the regression coefficients

Multiple assumptions listed in the dataset description are confirmed by the data at hand. A car crash is less likely for a driver living in a rural area, driving an expensive car only for private purposes (not as a job), and driving not too often. If this person has not had their license revoked in the past 7 years, and is a manager living in an expensive house and is a parent, the chances for an accident are further decreased.
On the other hand, a more likely accident participant is a person who frequently drives their sports car or SUV, lives in a city, had their licence revoked / has a high number of MVR points and has already claimed accident insurance several times in the past. The chances are further increased if they have teenage children who can drive their car as well.

The diagnostic plots for the model can be generated using the R code provided in the appendix.

3.1.2. Reduced model 1 (manual variable selection and transformed predictors)

For the second model, the following changes are made:

  • The predictors that were not significant in the full model are excluded: AGE, YOJ, SEX, RED_CAR, CAR_AGE
  • The remaining continuous predictors are centered and scaled

Model summary

## 
## Call:
## NULL
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.5580  -0.7171  -0.4087   0.6337   3.1319  
## 
## Coefficients:
##                                    Estimate Std. Error z value Pr(>|z|)
## (Intercept)                       -1.407035   0.038304 -36.733  < 2e-16
## KIDSDRIV                           0.213556   0.034074   6.267 3.67e-10
## HOMEKIDS                           0.044324   0.041900   1.058 0.290125
## PARENT1Yes                         0.130584   0.040985   3.186 0.001442
## MSTATUSz_No                        0.236731   0.045017   5.259 1.45e-07
## EDUCATIONBachelors                -0.149129   0.053968  -2.763 0.005722
## EDUCATIONMasters                  -0.060779   0.071246  -0.853 0.393607
## EDUCATIONPhD                      -0.038183   0.063695  -0.599 0.548867
## EDUCATIONz_High_School             0.003154   0.047735   0.066 0.947321
## JOBClerical                        0.204379   0.079016   2.587 0.009694
## JOBDoctor                         -0.044751   0.049428  -0.905 0.365264
## JOBHome_Maker                      0.091013   0.060578   1.502 0.132990
## JOBLawyer                          0.041449   0.056926   0.728 0.466536
## JOBManager                        -0.143178   0.061783  -2.317 0.020481
## JOBProfessional                    0.072877   0.068695   1.061 0.288743
## JOBStudent                         0.092245   0.066118   1.395 0.162969
## JOBz_Blue_Collar                   0.146933   0.086086   1.707 0.087856
## TRAVTIME                           0.188691   0.033077   5.705 1.17e-08
## CAR_USEPrivate                    -0.380361   0.049035  -7.757 8.71e-15
## BLUEBOOK                          -0.214077   0.044449  -4.816 1.46e-06
## TIF                               -0.224648   0.033757  -6.655 2.84e-11
## CAR_TYPEPanel_Truck                0.173621   0.046802   3.710 0.000207
## CAR_TYPEPickup                     0.245621   0.042209   5.819 5.91e-09
## CAR_TYPESports_Car                 0.316079   0.037460   8.438  < 2e-16
## CAR_TYPEVan                        0.180455   0.039724   4.543 5.55e-06
## CAR_TYPEz_SUV                      0.354496   0.043156   8.214  < 2e-16
## OLDCLAIM                          -0.129724   0.038041  -3.410 0.000649
## CLM_FREQ                           0.235571   0.036695   6.420 1.37e-10
## REVOKEDYes                         0.296062   0.032993   8.973  < 2e-16
## MVR_PTS                            0.241905   0.032633   7.413 1.24e-13
## `URBANICITYz_Highly_Rural/ Rural` -0.902502   0.049173 -18.354  < 2e-16
## INCOME                            -0.157709   0.056454  -2.794 0.005213
## HOME_VAL                          -0.146600   0.048473  -3.024 0.002492
##                                      
## (Intercept)                       ***
## KIDSDRIV                          ***
## HOMEKIDS                             
## PARENT1Yes                        ** 
## MSTATUSz_No                       ***
## EDUCATIONBachelors                ** 
## EDUCATIONMasters                     
## EDUCATIONPhD                         
## EDUCATIONz_High_School               
## JOBClerical                       ** 
## JOBDoctor                            
## JOBHome_Maker                        
## JOBLawyer                            
## JOBManager                        *  
## JOBProfessional                      
## JOBStudent                           
## JOBz_Blue_Collar                  .  
## TRAVTIME                          ***
## CAR_USEPrivate                    ***
## BLUEBOOK                          ***
## TIF                               ***
## CAR_TYPEPanel_Truck               ***
## CAR_TYPEPickup                    ***
## CAR_TYPESports_Car                ***
## CAR_TYPEVan                       ***
## CAR_TYPEz_SUV                     ***
## OLDCLAIM                          ***
## CLM_FREQ                          ***
## REVOKEDYes                        ***
## MVR_PTS                           ***
## `URBANICITYz_Highly_Rural/ Rural` ***
## INCOME                            ** 
## HOME_VAL                          ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 7536.3  on 6529  degrees of freedom
## Residual deviance: 5891.1  on 6497  degrees of freedom
## AIC: 5957.1
## 
## Number of Fisher Scoring iterations: 5
##   parameter  Accuracy     Kappa  AccuracySD    KappaSD
## 1      none 0.7860639 0.3741955 0.009405976 0.02867255
## glm variable importance
## 
##   only 20 most important variables shown (out of 32)
## 
##                                   Overall
## `URBANICITYz_Highly_Rural/ Rural`  18.354
## REVOKEDYes                          8.973
## CAR_TYPESports_Car                  8.438
## CAR_TYPEz_SUV                       8.214
## CAR_USEPrivate                      7.757
## MVR_PTS                             7.413
## TIF                                 6.655
## CLM_FREQ                            6.420
## KIDSDRIV                            6.267
## CAR_TYPEPickup                      5.819
## TRAVTIME                            5.705
## MSTATUSz_No                         5.259
## BLUEBOOK                            4.816
## CAR_TYPEVan                         4.543
## CAR_TYPEPanel_Truck                 3.710
## OLDCLAIM                            3.410
## PARENT1Yes                          3.186
## HOME_VAL                            3.024
## INCOME                              2.794
## EDUCATIONBachelors                  2.763

We can see that in this reduced model, all predictors are now significant, and the accuracy has remained the same, and the AIC has grown only marginally.

The interpretation of the coefficients has stayed the same as in the full model.

3.1.3. Reduced model 2 (LASSO Model)

The third model is build using LASSO, a regularized regression approach from the glmnet package that that fits a generalized linear model via penalized maximum likelihood. As in the second model, the continuous predictors are centered and scaled.
The penalty parameter is chosen automatically using cross-validation.

Model summary

alpha lambda Accuracy Kappa AccuracySD KappaSD
8 1 0.0019113 0.7866165 0.3670499 0.0108713 0.0304967
## # A tibble: 38 x 2
##      beta predictor                      
##     <dbl> <chr>                          
##  1 -2.12  URBANICITYz_Highly_Rural/ Rural
##  2  0.867 CAR_TYPESports_Car             
##  3  0.821 REVOKEDYes                     
##  4 -0.771 CAR_USEPrivate                 
##  5  0.664 CAR_TYPEz_SUV                  
##  6 -0.660 (Intercept)                    
##  7 -0.577 JOBManager                     
##  8  0.505 CAR_TYPEPickup                 
##  9  0.432 MSTATUSz_No                    
## 10  0.431 CAR_TYPEVan                    
## # ... with 28 more rows

We can see that the LASSO model selection has resulted in setting the coefficients for the low-importance predictors to zero or nearly zero. In this way, the LASSO model does automated variable selection. The low-importance predictors are the similar to those identified in the models above: RED_CAR, YOJ, CAR_AGE, AGE; however, also the predictors that are significant in the other models have received a penalty: AGE, INCOME, HOME_VAL, BLUEBOOK.
The model accuracy is on par with the full model at 78.5%.

The interpretation of the model coefficients for the most important variables remains the same as above.

3.2. Building Models for the continous response variable

In this section, two multinomial models are built for the continous response variable provided in the dataset: TARGET_AMT - the value of the insurance claim in the case when there was an accident.

Analog to the previous section, the first model built is a full model with all the available predictors. The second model uses a reduced set of predictors, scaled and centered variables, and excludes outliers.

The performance of both models is compared on the same out-of-sample dataset and measured based on the adjusted R-squared and RMSE.

3.2.1 Continous Model 1 (full model)

The first model built for continous data is a full model with non-transformed predictors.

## 
## Call:
## lm(formula = .outcome ~ ., data = dat)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -5304  -1692   -789    346 103654 
## 
## Coefficients:
##                                     Estimate Std. Error t value Pr(>|t|)
## (Intercept)                        7.533e+02  6.372e+02   1.182 0.237168
## KIDSDRIV                           3.362e+02  1.318e+02   2.550 0.010789
## HOMEKIDS                           7.964e+01  7.541e+01   1.056 0.290997
## PARENT1Yes                         4.664e+02  2.326e+02   2.005 0.044995
## MSTATUSz_No                        5.749e+02  1.646e+02   3.492 0.000483
## SEXz_F                            -5.042e+02  2.132e+02  -2.365 0.018056
## EDUCATIONBachelors                -1.366e+02  2.336e+02  -0.585 0.558689
## EDUCATIONMasters                   1.196e+02  3.416e+02   0.350 0.726216
## EDUCATIONPhD                       5.253e+02  4.073e+02   1.290 0.197160
## EDUCATIONz_High_School            -3.873e+01  1.977e+02  -0.196 0.844705
## JOBClerical                        5.962e+02  3.937e+02   1.514 0.130026
## JOBDoctor                         -5.573e+02  4.597e+02  -1.212 0.225437
## JOBHome_Maker                      4.687e+02  4.200e+02   1.116 0.264468
## JOBLawyer                          3.300e+02  3.407e+02   0.969 0.332722
## JOBManager                        -4.065e+02  3.301e+02  -1.232 0.218166
## JOBProfessional                    6.563e+02  3.565e+02   1.841 0.065690
## JOBStudent                         4.068e+02  4.307e+02   0.944 0.344973
## JOBz_Blue_Collar                   6.113e+02  3.708e+02   1.649 0.099246
## TRAVTIME                           1.186e+01  3.710e+00   3.198 0.001390
## CAR_USEPrivate                    -8.235e+02  1.897e+02  -4.340 1.44e-05
## BLUEBOOK                           1.739e-02  9.964e-03   1.745 0.080986
## TIF                               -4.177e+01  1.398e+01  -2.987 0.002825
## CAR_TYPEPanel_Truck                2.730e+02  3.212e+02   0.850 0.395465
## CAR_TYPEPickup                     4.387e+02  1.962e+02   2.236 0.025398
## CAR_TYPESports_Car                 1.121e+03  2.521e+02   4.447 8.84e-06
## CAR_TYPEVan                        6.023e+02  2.425e+02   2.483 0.013044
## CAR_TYPEz_SUV                      8.603e+02  2.078e+02   4.140 3.51e-05
## RED_CARyes                        -2.651e+02  1.710e+02  -1.550 0.121169
## OLDCLAIM                          -1.191e-02  8.750e-03  -1.361 0.173635
## CLM_FREQ                           1.391e+02  6.389e+01   2.178 0.029465
## REVOKEDYes                         4.762e+02  2.024e+02   2.353 0.018656
## MVR_PTS                            1.666e+02  2.975e+01   5.600 2.23e-08
## `URBANICITYz_Highly_Rural/ Rural` -1.656e+03  1.612e+02 -10.276  < 2e-16
## AGE                                1.143e+01  8.162e+00   1.400 0.161480
## INCOME                            -3.970e-03  2.088e-03  -1.901 0.057335
## YOJ                               -2.144e+00  1.693e+01  -0.127 0.899238
## HOME_VAL                          -9.419e-04  6.694e-04  -1.407 0.159488
## CAR_AGE                           -3.468e+01  1.401e+01  -2.476 0.013302
##                                      
## (Intercept)                          
## KIDSDRIV                          *  
## HOMEKIDS                             
## PARENT1Yes                        *  
## MSTATUSz_No                       ***
## SEXz_F                            *  
## EDUCATIONBachelors                   
## EDUCATIONMasters                     
## EDUCATIONPhD                         
## EDUCATIONz_High_School               
## JOBClerical                          
## JOBDoctor                            
## JOBHome_Maker                        
## JOBLawyer                            
## JOBManager                           
## JOBProfessional                   .  
## JOBStudent                           
## JOBz_Blue_Collar                  .  
## TRAVTIME                          ** 
## CAR_USEPrivate                    ***
## BLUEBOOK                          .  
## TIF                               ** 
## CAR_TYPEPanel_Truck                  
## CAR_TYPEPickup                    *  
## CAR_TYPESports_Car                ***
## CAR_TYPEVan                       *  
## CAR_TYPEz_SUV                     ***
## RED_CARyes                           
## OLDCLAIM                             
## CLM_FREQ                          *  
## REVOKEDYes                        *  
## MVR_PTS                           ***
## `URBANICITYz_Highly_Rural/ Rural` ***
## AGE                                  
## INCOME                            .  
## YOJ                                  
## HOME_VAL                             
## CAR_AGE                           *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4667 on 6491 degrees of freedom
## Multiple R-squared:  0.06712,    Adjusted R-squared:  0.0618 
## F-statistic: 12.62 on 37 and 6491 DF,  p-value: < 2.2e-16
intercept RMSE Rsquared MAE RMSESD RsquaredSD MAESD
TRUE 4647.431 0.0583734 2013.604 599.4799 0.0132851 80.03826

From the model summary we can see that while several predictors are highly significant, the errors are not nearly normal, and the Adjusted R-squared value is very low at 0.06.
The diagnostic plots show:
1) a very severe violation of normality in the residuals for the higher values of the response variable (beyond 2 standard deviations)
2) a number of high-impact residuals that are affecting the model

Overall, the fit is poor due to
1) low correlation between the individual predictors and the outcome 2) the severe skew in the response variable
3) presence of extreme outliers

3.2.2 Continous Model 2 (reduced model)

The second continuous model tries to alleviate the identified problems by: 1) excluding the outlier records
2) excluding the variables that are not correlated with the response
3) centering and scaling the remaining continous predictors
4) applying a log-transformation on the response

## 
## Call:
## lm(formula = .outcome ~ ., data = dat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.4387 -2.3645 -0.9325  2.2231 10.3605 
## 
## Coefficients:
##                                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                        2.18234    0.04028  54.174  < 2e-16 ***
## KIDSDRIV                           0.27176    0.04180   6.502 8.50e-11 ***
## PARENT1Yes                         0.21993    0.04771   4.610 4.11e-06 ***
## MSTATUSz_No                        0.37264    0.04676   7.969 1.88e-15 ***
## SEXz_F                            -0.12521    0.05921  -2.115 0.034501 *  
## JOBClerical                        0.29740    0.08516   3.492 0.000482 ***
## JOBDoctor                         -0.06065    0.05239  -1.158 0.247075    
## JOBHome_Maker                      0.16410    0.07257   2.261 0.023764 *  
## JOBLawyer                          0.05130    0.06981   0.735 0.462472    
## JOBManager                        -0.22458    0.07020  -3.199 0.001386 ** 
## JOBProfessional                    0.08289    0.07342   1.129 0.258906    
## JOBStudent                         0.23115    0.07354   3.143 0.001678 ** 
## JOBz_Blue_Collar                   0.26908    0.08935   3.012 0.002609 ** 
## TRAVTIME                           0.22905    0.04106   5.578 2.53e-08 ***
## CAR_USEPrivate                    -0.45599    0.06072  -7.510 6.71e-14 ***
## TIF                               -0.25567    0.04042  -6.325 2.70e-10 ***
## CAR_TYPEPanel_Truck                0.10666    0.05459   1.954 0.050783 .  
## CAR_TYPEPickup                     0.28875    0.05063   5.703 1.23e-08 ***
## CAR_TYPESports_Car                 0.42895    0.05167   8.302  < 2e-16 ***
## CAR_TYPEVan                        0.18453    0.04796   3.848 0.000120 ***
## CAR_TYPEz_SUV                      0.49695    0.05992   8.293  < 2e-16 ***
## CLM_FREQ                           0.22782    0.04533   5.026 5.14e-07 ***
## REVOKEDYes                         0.33497    0.04069   8.232  < 2e-16 ***
## MVR_PTS                            0.39157    0.04433   8.832  < 2e-16 ***
## `URBANICITYz_Highly_Rural/ Rural` -0.96849    0.04513 -21.462  < 2e-16 ***
## INCOME                            -0.30449    0.05752  -5.294 1.24e-07 ***
## CAR_AGE                           -0.11571    0.04954  -2.336 0.019531 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.254 on 6499 degrees of freedom
## Multiple R-squared:  0.217,  Adjusted R-squared:  0.2139 
## F-statistic: 69.29 on 26 and 6499 DF,  p-value: < 2.2e-16
intercept RMSE Rsquared MAE RMSESD RsquaredSD MAESD
TRUE 3.262434 0.2105621 2.651201 0.0361453 0.0164339 0.0334493

From the model summary we can see that the log-transformation of the response has helped solve the problem with the distribution of the residuals. As a result, the Adjusted R-squared metric has grown to 0.22. However, there is still a very strange remaining pattern in the standardized residuals.

4. MODEL SELECTION

4.1. Model selection for the binary response model

For the model selection step, the three models build in the previous section will be evaluated on the 20% out-of-sample data not used in the model building process. The predicted class is assigned at >50% probability.

Then, the following classification performance metrics will be compared: (a) accuracy, (b) classification error rate, (c) precision, (d) sensitivity, (e) specificity, (f) F1 score, (g) AUC, and (h) confusion matrix. The best performing model is the one with the highest F1 score and AUC values, as these metrics capture model sensitivity, specificity, and overall performance independent from the class cutoff threshold.

The three confusion matrices are provided below.

Confusion matrix: Full model (model 1)

##           Reference
## Prediction    0    1
##          0 1120  242
##          1   81  188

Confusion matrix: manually reduced model (model 2)

##           Reference
## Prediction    0    1
##          0 1120  243
##          1   81  187

Confusion matrix: stepwise selection model (model 3)

##           Reference
## Prediction    0    1
##          0 1128  254
##          1   73  176

Looking at table comparing classification performance metrics between the models provided below, we can see that the models are extremely similar in their performance. Therefore in practical terms the model m2 could still be considered as the final model, as it is easier to understand due to a lower number of predictors.

accuracy class_error_rate precision sensitivity specificity f1_score auc
m1 0.8019620 0.1980380 0.6988848 0.4372093 0.9325562 0.5379113 0.6848828
m2 0.8013489 0.1986511 0.6977612 0.4348837 0.9325562 0.5358166 0.6837200
m3 0.7995095 0.2004905 0.7068273 0.4093023 0.9392173 0.5184094 0.6742598

ROC Curves for the three models

Classification accuracy between the models can be also compared using ROC curves.

A comparison of the ROC curves shows that the models are indeed very similar in performance.

Predictions on the evaluation dataset

Predictions on the evaluation dataset are made using the model m3.

The output of the model on the evaluated data is available under the following URL: model_m3_eval_predictions.csv

4.2. Model selection for the continous response model

The performance of the continous models will be compared based on RMSE on the out-of sample data

model RMSE
model1 4034.550
model2 4427.868

The RMSE for the first (full) model is lower. From the charts below it is clear that the model two consistently produces very low values as compared to the true result.

So the initial full model will be selected for now to produce predictions on the evaluation data. However, further tuning could provide better precision of the predictions.

Predictions on the evaluation dataset

Predictions on the evaluation dataset are made using the model m1_cont.

The output of the model on the evaluated data is available under the following URL: model_m1_cont_eval_predictions.csv

Appendix

The full R code for the analysis in Rmd format is available under the following URL: hw4_insurance_data.Rmd

Reference

https://cran.r-project.org/web/packages/DataExplorer/vignettes/dataexplorer-intro.html https://www.statmethods.net/advgraphs/trellis.html http://topepo.github.io/caret/visualizations.html https://www.analyticsvidhya.com/blog/2016/03/tutorial-powerful-packages-imputing-missing-values/ https://web.stanford.edu/~hastie/glmnet/glmnet_alpha.html#lin https://stackoverflow.com/questions/35247522/error-in-cross-validation-in-glmnet-package-r-for-binomial-target-variable