DATA 621 Homework #4

Introduction

We have been given a dataset with 8161 records representing customers of an auto insurance company. Each record has two response variables. The first objective is to train a logistic regression classifier to predict if a person was in a car crash. The response variable here is TARGET_FLAG respresenting whether a person had an accident (1) or did not have one (0).

The second objective will be to train a regression model to predict the cost of a crash, if one occurred. The second reponse variable, TARGET_VALUE, is the amount it will cost if the person crashes their car. The value is zero if the person did not crash their car.

Looking at the distribution of the TARGET_AMT variable, we can see that the variable is considerably right-skewed. Thus, a LOG transform might be best here.

Data Preparation & Exploration

We will first look at the summary statistics for the data

INDEX TARGET_FLAG TARGET_AMT KIDSDRIV AGE HOMEKIDS YOJ INCOME PARENT1 HOME_VAL MSTATUS SEX EDUCATION JOB TRAVTIME CAR_USE BLUEBOOK TIF CAR_TYPE RED_CAR OLDCLAIM CLM_FREQ REVOKED MVR_PTS CAR_AGE URBANICITY
Min. : 1 Min. :0.0000 Min. : 0 Min. :0.0000 Min. :16.00 Min. :0.0000 Min. : 0.0 $0 : 615 No :7084 $0 :2294 Yes :4894 M :3786 <High School :1203 z_Blue Collar:1825 Min. : 5.00 Commercial:3029 $1,500 : 157 Min. : 1.000 Minivan :2145 no :5783 $0 :5009 Min. :0.0000 No :7161 Min. : 0.000 Min. :-3.000 Highly Urban/ Urban :6492
1st Qu.: 2559 1st Qu.:0.0000 1st Qu.: 0 1st Qu.:0.0000 1st Qu.:39.00 1st Qu.:0.0000 1st Qu.: 9.0 : 445 Yes:1077 : 464 z_No:3267 z_F:4375 Bachelors :2242 Clerical :1271 1st Qu.: 22.00 Private :5132 $6,000 : 34 1st Qu.: 1.000 Panel Truck: 676 yes:2378 $1,310 : 4 1st Qu.:0.0000 Yes:1000 1st Qu.: 0.000 1st Qu.: 1.000 z_Highly Rural/ Rural:1669
Median : 5133 Median :0.0000 Median : 0 Median :0.0000 Median :45.00 Median :0.0000 Median :11.0 $26,840 : 4 NA $111,129: 3 NA NA Masters :1658 Professional :1117 Median : 33.00 NA $5,800 : 33 Median : 4.000 Pickup :1389 NA $1,391 : 4 Median :0.0000 NA Median : 1.000 Median : 8.000 NA
Mean : 5152 Mean :0.2638 Mean : 1504 Mean :0.1711 Mean :44.79 Mean :0.7212 Mean :10.5 $48,509 : 4 NA $115,249: 3 NA NA PhD : 728 Manager : 988 Mean : 33.49 NA $6,200 : 33 Mean : 5.351 Sports Car : 907 NA $4,263 : 4 Mean :0.7986 NA Mean : 1.696 Mean : 8.328 NA
3rd Qu.: 7745 3rd Qu.:1.0000 3rd Qu.: 1036 3rd Qu.:0.0000 3rd Qu.:51.00 3rd Qu.:1.0000 3rd Qu.:13.0 $61,790 : 4 NA $123,109: 3 NA NA z_High School:2330 Lawyer : 835 3rd Qu.: 44.00 NA $6,400 : 31 3rd Qu.: 7.000 Van : 750 NA $1,105 : 3 3rd Qu.:2.0000 NA 3rd Qu.: 3.000 3rd Qu.:12.000 NA
Max. :10302 Max. :1.0000 Max. :107586 Max. :4.0000 Max. :81.00 Max. :5.0000 Max. :23.0 $107,375: 3 NA $153,061: 3 NA NA NA Student : 712 Max. :142.00 NA $5,900 : 30 Max. :25.000 z_SUV :2294 NA $1,332 : 3 Max. :5.0000 NA Max. :13.000 Max. :28.000 NA
NA NA NA NA NA’s :6 NA NA’s :454 (Other) :7086 NA (Other) :5391 NA NA NA (Other) :1413 NA NA (Other):7843 NA NA NA (Other):3134 NA NA NA NA’s :510 NA

There are some missing values that we will need to deal with. There are also some values that seem invalid (i.e. -3 CAR_AGE).

Fix Data Types

There are some variables that are currently factors that are dollar values that need to be transformed into numeric variables. We will do this to both the df and evaluation data frames. There are also some invalid data that will be changed to NAs.

Now that we have corrected the invalid variables, we can look at a sumamry of the data:

INDEX TARGET_FLAG TARGET_AMT KIDSDRIV AGE HOMEKIDS YOJ INCOME PARENT1 HOME_VAL MSTATUS SEX EDUCATION JOB TRAVTIME CAR_USE BLUEBOOK TIF CAR_TYPE RED_CAR OLDCLAIM CLM_FREQ REVOKED MVR_PTS CAR_AGE URBANICITY
Min. : 1 0:6008 Min. : 0 Min. :0.0000 Min. :16.00 Min. :0.0000 Min. : 0.0 Min. : 0 No :7084 Min. : 0 Yes :4894 M :3786 <High School :1203 z_Blue Collar:1825 Min. : 5.00 Commercial:3029 Min. : 1500 Min. : 1.000 Minivan :2145 no :5783 Min. : 0 Min. :0.0000 No :7161 Min. : 0.000 Min. : 0.00 Highly Urban/ Urban :6492
1st Qu.: 2559 1:2153 1st Qu.: 0 1st Qu.:0.0000 1st Qu.:39.00 1st Qu.:0.0000 1st Qu.: 9.0 1st Qu.: 28097 Yes:1077 1st Qu.: 0 z_No:3267 z_F:4375 Bachelors :2242 Clerical :1271 1st Qu.: 22.00 Private :5132 1st Qu.: 9280 1st Qu.: 1.000 Panel Truck: 676 yes:2378 1st Qu.: 0 1st Qu.:0.0000 Yes:1000 1st Qu.: 0.000 1st Qu.: 1.00 z_Highly Rural/ Rural:1669
Median : 5133 NA Median : 0 Median :0.0000 Median :45.00 Median :0.0000 Median :11.0 Median : 54028 NA Median :161160 NA NA Masters :1658 Professional :1117 Median : 33.00 NA Median :14440 Median : 4.000 Pickup :1389 NA Median : 0 Median :0.0000 NA Median : 1.000 Median : 8.00 NA
Mean : 5152 NA Mean : 1504 Mean :0.1711 Mean :44.79 Mean :0.7212 Mean :10.5 Mean : 61898 NA Mean :154867 NA NA PhD : 728 Manager : 988 Mean : 33.49 NA Mean :15710 Mean : 5.351 Sports Car : 907 NA Mean : 4037 Mean :0.7986 NA Mean : 1.696 Mean : 8.33 NA
3rd Qu.: 7745 NA 3rd Qu.: 1036 3rd Qu.:0.0000 3rd Qu.:51.00 3rd Qu.:1.0000 3rd Qu.:13.0 3rd Qu.: 85986 NA 3rd Qu.:238724 NA NA z_High School:2330 Lawyer : 835 3rd Qu.: 44.00 NA 3rd Qu.:20850 3rd Qu.: 7.000 Van : 750 NA 3rd Qu.: 4636 3rd Qu.:2.0000 NA 3rd Qu.: 3.000 3rd Qu.:12.00 NA
Max. :10302 NA Max. :107586 Max. :4.0000 Max. :81.00 Max. :5.0000 Max. :23.0 Max. :367030 NA Max. :885282 NA NA NA Student : 712 Max. :142.00 NA Max. :69740 Max. :25.000 z_SUV :2294 NA Max. :57037 Max. :5.0000 NA Max. :13.000 Max. :28.00 NA
NA NA NA NA NA’s :6 NA NA’s :454 NA’s :445 NA NA’s :464 NA NA NA (Other) :1413 NA NA NA NA NA NA NA NA NA NA NA’s :511 NA

Fix Missing Values

There are 511 observations where the CAR_AGE variable is missing, 454 observations where variable YOJ is missing, 6 observations where the variable AGE is missing, 445 observations where the variable INCOME is missing, and 464 observations where the variable HOME_VAL is missing. There are 1714, or 21% of the observations missing varables. We will fill in the missing data with the median value.

Creating Training/Test Data Sets

For Classifier Model

Now that we have a complete data set we will split the data into a training (train) and test set (test) for the classifier. We’ll use a 70-30 split between train and test, respectively.

There are 1508 out of 5714 records in the training data set that have been in an accident. We want to correct the imbalance in the dataset by over sampling this group so our classifier will do a better job identifying this minority group.

The over sampled data frame has 8412 records, and as you can see in the following figure is now balanced:

For Linear Regression Model

Since the linear regression model’s goal is to predict the cost of the claim we will subset the 8161 records for those involved in an accident. We will split these 2153 records into training and test sets using the same 70-30 split.

There are 1509 out of 2153 records in the training data set.

Exploratory Data Analysis

In exploring the data we are looking for two things. We are looking for variables that will help use divide the data into those who have been in an accident and those who have not. We are also looking for variables that are linearly correlated with the claim amount so it may be used as a predictor for the linear regression model. We will be looking at both training set for the two model types.

First we will examine the numeric variables found in the oversampled classification data set:

There are a few variables that seem to have a difference between the two groups. For example CLM_FREQ might be useful in differentiating. There isn’t much difference between a lot of groups which is somewhat understandable.

Next we will look at the catagorical variables in the oversampled classifcation data set. These charts look to see if a variable can be uses to distinguish those that have had an accident (blue) from those who haven’t (red).

The URBAN_DRIVER is much more likely to be in an accident than the rural driver. The YOUNG and those with a PRIOR_ACCIDENT are also more likely to be in an accident. There doesn’t seem to be much evidence for a difference between the sexes.

Now let’s focus on the relationship between these variables and the size of the claim. As a reminder this data only includes those who have been in an accident. For starters we will look at the distribution of the claims for those in an accident.

As was previously noted this distribution has a long tail. The mean payout is $5689 and the median is $4102. The median and mean are higher, of course for those observations we classified as outliers. The outlier cutoff point is $10594.

TARGET_AMT_OUTLIER Mean Median
No 4032.048 3893.00
Yes 27402.149 22951.25

Now to look at the data for the linear regression model. We are looking for good predictors of the claim amount. We will look at the numeric variables scatter and correlation plots:

KIDSDRIV 0.0061500
AGE 0.0150957
HOMEKIDS 0.0171601
YOJ 0.0300531
INCOME 0.0404098
HOME_VAL 0.0442609
TRAVTIME 0.0120292
BLUEBOOK 0.1257628
TIF -0.0180912
OLDCLAIM 0.0124373
CLM_FREQ -0.0026814
MVR_PTS 0.0425935
CAR_AGE -0.0040321
LOG_INCOME 0.0384403
LOG_HOME_VAL 0.0289741
AVG_CLAIM 0.0310000
TARGET_AMT_OUTLIER 0.7901975

These two points highlight that there isn’t a really strong correlation between most of the predictors and the claim amount. The only one that is strong is the outlier flag that we created. Let’s finish our exploration by looking at the categoricial variables. We are removing the TARGET_AMT_OUTLIERS from the plot to increase their readability.

These boxplots show that there isn’t really much difference in the claim amounts between these groups.

Summary of Findings

Through this analysis we have found that there is a lot of noise associated with this signal. When it comes the the classification task, most variables offer a similar picture. This is probably due to the random nature of vehicle accidents. There are a few variables that may be helpful in building our models.

The claim prediciton task will definately be more of a challenge. Little is correlated with the claim amounts. The model fit will not be as strong as the classification models.

Model Creation & Evaluation

Now that we have created datasets and explored them, it is time to build some predictive models and evaluate them.

For the classification models we will look at the prediction accuracy on the test set. Especially with the accuracy at discovering those that are in an accident. Accurate prediction of this minority class indicates a good model.

Classification Model

Baseline Model

We will create a simple model to serve as the baseline. This model posits that drivers history is a good representation of their future. So drivers that have been in an accident (OLDCLAIM > 0 or AVG_CLAIM > 0) are more likely to be in an accident. Conversely those who haven’t aren’t likely to be in one in the future.


Call:
glm(formula = TARGET_FLAG ~ PRIOR_ACCIDENT, family = binomial(link = "logit"), 
    data = over_sample_train)

Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-1.45262  -0.97194  -0.02338   0.92517   1.39782  

Coefficients:
                Estimate Std. Error z value            Pr(>|z|)    
(Intercept)     -0.50462    0.03031  -16.65 <0.0000000000000002 ***
PRIOR_ACCIDENT1  1.13171    0.04567   24.78 <0.0000000000000002 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 11662  on 8411  degrees of freedom
Residual deviance: 11022  on 8410  degrees of freedom
AIC: 11026

Number of Fisher Scoring iterations: 4
F1 = 0.4659588 
R2 = 0.05485262 

Confusion Matrix and Statistics

          Reference
Prediction    0    1
         0 1219  272
         1  583  373
                                             
               Accuracy : 0.6506             
                 95% CI : (0.6313, 0.6695)   
    No Information Rate : 0.7364             
    P-Value [Acc > NIR] : 1                  
                                             
                  Kappa : 0.2206             
                                             
 Mcnemar's Test P-Value : <0.0000000000000002
                                             
            Sensitivity : 0.5783             
            Specificity : 0.6765             
         Pos Pred Value : 0.3902             
         Neg Pred Value : 0.8176             
             Prevalence : 0.2636             
         Detection Rate : 0.1524             
   Detection Prevalence : 0.3907             
      Balanced Accuracy : 0.6274             
                                             
       'Positive' Class : 1                  
                                             

This simple model has statistically significant predictors. Applying this model to the test data set indicates this simple model has a 65.1% accuracy rate. It correcly identified 57.8% of the people with accidents and 67.6% of those without. It is better than flipping a coin but not by much. All future models must outpreform this baseline.

Risk Taker Model

For this model we are going assume that risk takers are more likely to be in an accident. To identify those who tend to be risk takers, we are going with the urban legend that those who drive red sports cars are risk takers. Also young men (16-24) also tend to be risk takers.


Call:
glm(formula = TARGET_FLAG ~ RED_SPORTS_CAR + YOUNG_MALE, family = binomial(link = "logit"), 
    data = over_sample_train)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.3791  -1.1759  -0.0938   1.1789   1.1789  

Coefficients:
                 Estimate Std. Error z value Pr(>|z|)
(Intercept)     -0.003599   0.021907  -0.164    0.869
RED_SPORTS_CAR1  0.308981   0.352893   0.876    0.381
YOUNG_MALE1      0.466223   0.310351   1.502    0.133

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 11662  on 8411  degrees of freedom
Residual deviance: 11658  on 8409  degrees of freedom
AIC: 11664

Number of Fisher Scoring iterations: 3
F1 = 0.03927492 
R2 = 0.0002641343 

Confusion Matrix and Statistics

          Reference
Prediction    0    1
         0 1798  632
         1    4   13
                                             
               Accuracy : 0.7401             
                 95% CI : (0.7222, 0.7574)   
    No Information Rate : 0.7364             
    P-Value [Acc > NIR] : 0.3494             
                                             
                  Kappa : 0.0261             
                                             
 Mcnemar's Test P-Value : <0.0000000000000002
                                             
            Sensitivity : 0.020155           
            Specificity : 0.997780           
         Pos Pred Value : 0.764706           
         Neg Pred Value : 0.739918           
             Prevalence : 0.263588           
         Detection Rate : 0.005313           
   Detection Prevalence : 0.006947           
      Balanced Accuracy : 0.508968           
                                             
       'Positive' Class : 1                  
                                             

This model only has one statistically significant predictor, however it has a 74% accuracy rate. This is because it identified 99.8% of those who didn’t have an accident (and they are more numerous). When looking at the model’s performance on identifying the minority group of interest we see it correcly identified 2% of the people with accidents. So while the accuracy rate out preformed the baseline model, this one will not be the model we will use.

Traditional Model

Traditional wisdom holds that there are a few tried and true predictors of someone’s risk of an automobile accident. Namely: age (under 25 vs. 25+), marital status, driving record, distance to work, and the driver’s sex (which may or may not be proper to use, but seems to have held throughout the years).


Call:
glm(formula = TARGET_FLAG ~ YOUNG + MSTATUS + PRIOR_ACCIDENT + 
    SEX + REVOKED + MVR_PTS + TRAVTIME + CAR_USE, family = binomial(link = "logit"), 
    data = over_sample_train)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.2918  -0.9992  -0.1634   1.0394   1.9197  

Coefficients:
                 Estimate Std. Error z value             Pr(>|z|)    
(Intercept)     -1.127120   0.075485 -14.932 < 0.0000000000000002 ***
YOUNG1           0.803997   0.244210   3.292             0.000994 ***
MSTATUSz_No      0.585304   0.047681  12.275 < 0.0000000000000002 ***
PRIOR_ACCIDENT1  0.813243   0.051742  15.717 < 0.0000000000000002 ***
SEXz_F           0.298638   0.050099   5.961        0.00000000251 ***
REVOKEDYes       0.805760   0.068452  11.771 < 0.0000000000000002 ***
MVR_PTS          0.142383   0.011822  12.044 < 0.0000000000000002 ***
TRAVTIME         0.008707   0.001515   5.746        0.00000000912 ***
CAR_USEPrivate  -0.586681   0.051014 -11.500 < 0.0000000000000002 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 11662  on 8411  degrees of freedom
Residual deviance: 10350  on 8403  degrees of freedom
AIC: 10368

Number of Fisher Scoring iterations: 4
F1 = 0.510559 
R2 = 0.1124805 

Confusion Matrix and Statistics

          Reference
Prediction    0    1
         0 1248  234
         1  554  411
                                             
               Accuracy : 0.678              
                 95% CI : (0.6591, 0.6965)   
    No Information Rate : 0.7364             
    P-Value [Acc > NIR] : 1                  
                                             
                  Kappa : 0.2845             
                                             
 Mcnemar's Test P-Value : <0.0000000000000002
                                             
            Sensitivity : 0.6372             
            Specificity : 0.6926             
         Pos Pred Value : 0.4259             
         Neg Pred Value : 0.8421             
             Prevalence : 0.2636             
         Detection Rate : 0.1680             
   Detection Prevalence : 0.3944             
      Balanced Accuracy : 0.6649             
                                             
       'Positive' Class : 1                  
                                             

This model has statistically significant predictors. Applying this model to the test data set indicates this model has a 67.8% accuracy rate. It correcly identified 63.7% of the people with accidents and 69.3% of those without. This model out preforms the baseline model.

Traditional Model with Cross-Validation

Here we try to use cross-validation techniques to improve the traditional model. We go back to the raw training data, because it is typically not-advised to use data that is oversampled prior to cross-validation.

Generalized Linear Model 

5714 samples
   8 predictor
   2 classes: '0', '1' 

No pre-processing
Resampling: Cross-Validated (5 fold) 
Summary of sample sizes: 4572, 4571, 4571, 4571, 4571 
Addtional sampling using up-sampling

Resampling results:

  Accuracy   Kappa    
  0.6757072  0.2826379
Confusion Matrix and Statistics

          Reference
Prediction    0    1
         0 1249  236
         1  553  409
                                             
               Accuracy : 0.6776             
                 95% CI : (0.6586, 0.6961)   
    No Information Rate : 0.7364             
    P-Value [Acc > NIR] : 1                  
                                             
                  Kappa : 0.2826             
                                             
 Mcnemar's Test P-Value : <0.0000000000000002
                                             
            Sensitivity : 0.6341             
            Specificity : 0.6931             
         Pos Pred Value : 0.4252             
         Neg Pred Value : 0.8411             
             Prevalence : 0.2636             
         Detection Rate : 0.1671             
   Detection Prevalence : 0.3931             
      Balanced Accuracy : 0.6636             
                                             
       'Positive' Class : 1                  
                                             

Alternate Traditional Model

This model is an alternate to the traditional model.


Call:
glm(formula = TARGET_FLAG ~ PRIOR_ACCIDENT + KID_DRIVERS + MSTATUS + 
    INCOME + SEX + CAR_USE + COLLEGE_EDUCATED + REVOKED + URBAN_DRIVER, 
    family = binomial(link = "logit"), data = over_sample_train)

Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-2.47843  -0.95017   0.04407   0.92959   2.57995  

Coefficients:
                       Estimate    Std. Error z value             Pr(>|z|)
(Intercept)       -1.5346466203  0.0927535704 -16.545 < 0.0000000000000002
PRIOR_ACCIDENT1    0.7921271578  0.0506657588  15.634 < 0.0000000000000002
KID_DRIVERS1       0.8289530622  0.0742694029  11.161 < 0.0000000000000002
MSTATUSz_No        0.7589906840  0.0508503649  14.926 < 0.0000000000000002
INCOME            -0.0000083140  0.0000006582 -12.632 < 0.0000000000000002
SEXz_F             0.2759209075  0.0528561497   5.220          0.000000179
CAR_USEPrivate    -0.7343462515  0.0546026351 -13.449 < 0.0000000000000002
COLLEGE_EDUCATED1 -0.5201355594  0.0584399941  -8.900 < 0.0000000000000002
REVOKEDYes         0.6851814650  0.0720376503   9.511 < 0.0000000000000002
URBAN_DRIVER1      1.9516574434  0.0842859474  23.155 < 0.0000000000000002
                     
(Intercept)       ***
PRIOR_ACCIDENT1   ***
KID_DRIVERS1      ***
MSTATUSz_No       ***
INCOME            ***
SEXz_F            ***
CAR_USEPrivate    ***
COLLEGE_EDUCATED1 ***
REVOKEDYes        ***
URBAN_DRIVER1     ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 11661.5  on 8411  degrees of freedom
Residual deviance:  9501.3  on 8402  degrees of freedom
AIC: 9521.3

Number of Fisher Scoring iterations: 4
F1 = 0.5638611 
R2 = 0.1852393 

Confusion Matrix and Statistics

          Reference
Prediction    0    1
         0 1227  166
         1  575  479
                                             
               Accuracy : 0.6972             
                 95% CI : (0.6785, 0.7153)   
    No Information Rate : 0.7364             
    P-Value [Acc > NIR] : 1                  
                                             
                  Kappa : 0.3519             
                                             
 Mcnemar's Test P-Value : <0.0000000000000002
                                             
            Sensitivity : 0.7426             
            Specificity : 0.6809             
         Pos Pred Value : 0.4545             
         Neg Pred Value : 0.8808             
             Prevalence : 0.2636             
         Detection Rate : 0.1957             
   Detection Prevalence : 0.4307             
      Balanced Accuracy : 0.7118             
                                             
       'Positive' Class : 1                  
                                             

This model’s predictors are all statistically significant. Applying this model to the test data set indicates this model has a 69.7% accuracy rate. It correcly identified 74.3% of the people with accidents and 68.1% of those without. This model also out preforms the baseline model. Now that we have a few viable classification models, we can turn our attention to the linear claim model.

Linear Claim Model

Baseline Model

Given the unusual distribution one posible solution would be to predict the claim is the median value. This is a gross simplification but will serve as the baseline for the linear models to compete against. The way we are looking at how well these models preform is calculating the prediction error as a percentage of the actual and then averaging it across the data set.

Average Error Average Percent
-1630.658 0.504024

By simply using the median to predict the values of the claims we would underestimate the actual value on average by $1,600 dollars. On a percentage basis the error would be 50% of the actual.

Simple Linear Model

This simple linerar model is that the amount of the claim is based off of the value of the vehicle. More expensive vehicles should be more costly to repair than less expensive vehicles. It is overly simplistic.


Call:
lm(formula = TARGET_AMT ~ BLUEBOOK, data = amt_train)

Residuals:
   Min     1Q Median     3Q    Max 
 -7818  -3153  -1548    326  78883 

Coefficients:
              Estimate Std. Error t value             Pr(>|t|)    
(Intercept) 4038.05174  387.55794  10.419 < 0.0000000000000002 ***
BLUEBOOK       0.11399    0.02316   4.921          0.000000955 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 7536 on 1507 degrees of freedom
Multiple R-squared:  0.01582,   Adjusted R-squared:  0.01516 
F-statistic: 24.22 on 1 and 1507 DF,  p-value: 0.0000009546

While this predictor is statistically significant and positive, the model’s overall explanitory power is weak. Let’s see how it preformed on the test set:

Average Error Average Percent
-130.8995 1.001244

Well the average error has decreased which is good but the average error share of the actual has doubled. This is a garbage model.

Outlier Model

We previously flagged the outliers in the data set. IF we can build another classification model, how well will it preform?


Call:
lm(formula = TARGET_AMT ~ TARGET_AMT_OUTLIER, data = amt_train)

Residuals:
   Min     1Q Median     3Q    Max 
-16808  -1661   -186   1345  58122 

Coefficients:
                   Estimate Std. Error t value            Pr(>|t|)    
(Intercept)          4032.0      124.3   32.43 <0.0000000000000002 ***
TARGET_AMT_OUTLIER  23370.1      466.9   50.05 <0.0000000000000002 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4655 on 1507 degrees of freedom
Multiple R-squared:  0.6244,    Adjusted R-squared:  0.6242 
F-statistic:  2505 on 1 and 1507 DF,  p-value: < 0.00000000000000022

This model is unfair as it is predicting the outcomes based on a predictor derived from the outcome. Unsuprisingly this model does a really good job. It has an adjusted \(R^2\) of 0.6241628.

The reason we explored this model is to see if we has a reason to build another classifier. If we build a second classifier that can predict this variable would it be worth it. Let’s see how the model preforms on the test set:

Average Error Average Percent
-140.1835 0.5530068

The predictions on average understate the actual value by $140. The error as a percent of the actual is close to what we get with just using the median. The improvement really comes from using a higher value to estimate the claim for the outliers.

The challenge will be to create a classifier. There are 107 out of 1509 that are outliers. Let’s create another balanced data set and see if we can build another classifier.

Let’s explore the correlations to see what is correlated:

It looks like the BLUEBOOK and PARENT1 are the only variables that are correlated with this outcome. So let’s build the classifier.


Call:
glm(formula = TARGET_AMT_OUTLIER ~ BLUEBOOK + PARENT1, family = binomial(link = "logit"), 
    data = over_sample_train_2)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.0215  -1.1039  -0.1248   1.1664   1.4622  

Coefficients:
                Estimate   Std. Error z value             Pr(>|z|)    
(Intercept) -0.889430410  0.084530469 -10.522 < 0.0000000000000002 ***
BLUEBOOK     0.000048194  0.000004618  10.435 < 0.0000000000000002 ***
PARENT1Yes   0.441269006  0.087932394   5.018          0.000000521 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 3887.2  on 2803  degrees of freedom
Residual deviance: 3732.8  on 2801  degrees of freedom
AIC: 3738.8

Number of Fisher Scoring iterations: 4
F1 = 0.04214963 
R2 = 0.03970578 

Confusion Matrix and Statistics

          Reference
Prediction    0    1
         0 1518   24
         1  885   20
                                             
               Accuracy : 0.6285             
                 95% CI : (0.609, 0.6477)    
    No Information Rate : 0.982              
    P-Value [Acc > NIR] : 1                  
                                             
                  Kappa : 0.0081             
                                             
 Mcnemar's Test P-Value : <0.0000000000000002
                                             
            Sensitivity : 0.454545           
            Specificity : 0.631710           
         Pos Pred Value : 0.022099           
         Neg Pred Value : 0.984436           
             Prevalence : 0.017981           
         Detection Rate : 0.008173           
   Detection Prevalence : 0.369841           
      Balanced Accuracy : 0.543128           
                                             
       'Positive' Class : 1                  
                                             

Now to test this out. Based on our linear regression test set we will predict if the observation is likely to be an outlier or not. We will then estimate the TARGET_AMT using the linear model. We will then evaluate the results of the “model”.

Average Error Average Percent
-140.1835 0.5530068

So through this somewhat convoluted approach we have done slightly better than using the median.

Mike Silva

2019-11-15