(1) DATA PREPARATION

From the summary() function, the training set has 8161 observations on 26 variables, which include an index and two response variables (TARGET_FLAG and TARGET_AMT).

From the summary, one notes the Mean :0.2638, the same as the percentage of cases filing a claim.
We also note the number of NA values for AGE (6), YOJ (454), INCOME (445), HOME_VAL (464), CAR_AGE (510).

sum(train$TARGET_FLAG == 0)
## [1] 6008
sum(train$TARGET_AMT == 0)
## [1] 6008

Of the 8161 total cases, 6008 of them resulted in no claim. We confirm, for each no claim case, the claim amount is 0.

There are also a few quantitive variables that are best expressed as categorical ordinal variables.

train$KIDSDRIV <- as.factor(train$KIDSDRIV)
train$HOMEKIDS <- as.factor(train$HOMEKIDS)
train$CLM_FREQ <- as.factor(train$CLM_FREQ)

There are several categorical variables that are best expressed as quantitative ones.

train$INCOME <- as.numeric(gsub('\\$|,', '', train$INCOME))
train$HOME_VAL <- as.numeric(gsub('\\$|,', '', train$HOME_VAL))
train$BLUEBOOK <- as.numeric(gsub('\\$|,', '', train$BLUEBOOK))
train$OLDCLAIM <- as.numeric(gsub('\\$|,', '', train$OLDCLAIM))

Finally removing all cases with null values, leaving 6448 cases, of which 4745 are without claims (73.6%).

train1 <- train[complete.cases(train),]
nrow(train1)
## [1] 6448
sum(train1$TARGET_FLAG == 0) / nrow(train1)
## [1] 0.7358871
#6448 entries
summary(train1)
##      INDEX        TARGET_FLAG       TARGET_AMT    KIDSDRIV      AGE       
##  Min.   :    1   Min.   :0.0000   Min.   :    0   0:5683   Min.   :16.00  
##  1st Qu.: 2472   1st Qu.:0.0000   1st Qu.:    0   1: 493   1st Qu.:39.00  
##  Median : 5034   Median :0.0000   Median :    0   2: 220   Median :45.00  
##  Mean   : 5083   Mean   :0.2641   Mean   : 1497   3:  50   Mean   :44.74  
##  3rd Qu.: 7671   3rd Qu.:1.0000   3rd Qu.: 1036   4:   2   3rd Qu.:51.00  
##  Max.   :10302   Max.   :1.0000   Max.   :85524            Max.   :81.00  
##                                                                           
##  HOMEKIDS      YOJ            INCOME       PARENT1       HOME_VAL     
##  0:4178   Min.   : 0.00   Min.   :     0   No :5590   Min.   :     0  
##  1: 706   1st Qu.: 9.00   1st Qu.: 28180   Yes: 858   1st Qu.:     0  
##  2: 876   Median :11.00   Median : 54287              Median :162331  
##  3: 544   Mean   :10.54   Mean   : 62020              Mean   :155438  
##  4: 133   3rd Qu.:13.00   3rd Qu.: 86438              3rd Qu.:239934  
##  5:  11   Max.   :23.00   Max.   :367030              Max.   :885282  
##                                                                       
##  MSTATUS      SEX               EDUCATION               JOB      
##  Yes :3832   M  :3000   <High School : 955   z_Blue Collar:1476  
##  z_No:2616   z_F:3448   Bachelors    :1740   Clerical     :1031  
##                         Masters      :1313   Professional : 868  
##                         PhD          : 568   Manager      : 779  
##                         z_High School:1872   Lawyer       : 670  
##                                              Student      : 537  
##                                              (Other)      :1087  
##     TRAVTIME            CAR_USE        BLUEBOOK          TIF        
##  Min.   :  5.00   Commercial:2404   Min.   : 1500   Min.   : 1.000  
##  1st Qu.: 23.00   Private   :4044   1st Qu.: 9388   1st Qu.: 1.000  
##  Median : 33.00                     Median :14460   Median : 4.000  
##  Mean   : 33.66                     Mean   :15761   Mean   : 5.377  
##  3rd Qu.: 44.00                     3rd Qu.:20820   3rd Qu.: 7.000  
##  Max.   :142.00                     Max.   :69740   Max.   :25.000  
##                                                                     
##         CAR_TYPE    RED_CAR       OLDCLAIM     CLM_FREQ REVOKED   
##  Minivan    :1716   no :4566   Min.   :    0   0:3984   No :5658  
##  Panel Truck: 528   yes:1882   1st Qu.:    0   1: 765   Yes: 790  
##  Pickup     :1111              Median :    0   2: 915             
##  Sports Car : 719              Mean   : 4046   3: 615             
##  Van        : 577              3rd Qu.: 4600   4: 156             
##  z_SUV      :1797              Max.   :57037   5:  13             
##                                                                   
##     MVR_PTS          CAR_AGE                       URBANICITY  
##  Min.   : 0.000   Min.   :-3.000   Highly Urban/ Urban  :5134  
##  1st Qu.: 0.000   1st Qu.: 1.000   z_Highly Rural/ Rural:1314  
##  Median : 1.000   Median : 8.000                               
##  Mean   : 1.705   Mean   : 8.296                               
##  3rd Qu.: 3.000   3rd Qu.:12.000                               
##  Max.   :13.000   Max.   :28.000                               
## 

(2) DATA EXPLORATION

Further for variables with multiple categories, we construct probability tables, to check for features associated with higher incidence of claim. The ones we investigate are: KIDSDRIV, HOMEKIDS, EDUCATION, JOB, and CAR_TYPE.

KIDSDRIV

Having even one kid who drives raises the incidence of making a claim. As number of children increases the ratio of claims to non-claims increases.

ggplot(train, aes(KIDSDRIV, TARGET_FLAG)) +
geom_jitter(aes(color = TARGET_FLAG), size = 0.5)

HOMEKIDS

Indeed, simply having kids makes it more likely one would have made a damage claims.

EDUCATION

We can view from this probability table that High School educated people have a far great share of the claims cases, while Masters’ and PhD’s are lower than their share of the non-claims cases.

JOB

Doctors, lawyers, and managers indeed have lower incidence of claims than students and blue collars.

CAR_TYPE

From the probability table of CAR_TYPE we note a higher incidence of accidents among SUVs, sports cars, and pick ups, and significantly lower incidence among minivans.

Here we visualize the age against claim occurrance, looking for indications if younger and older drivers are more likely to get into more accidents.

ggplot(train, aes(AGE, TARGET_FLAG)) +
geom_jitter(aes(color = TARGET_FLAG), size = 0.5)
## Warning: Removed 6 rows containing missing values (geom_point).

For lower ages, higher crash rate does seem to be the case.

In this visualization, we are not convinced that increased travel time to work has an obvious positive impact on claim incidence as might be expected.

ggplot(train1, aes(TRAVTIME, TARGET_FLAG)) +
geom_jitter(aes(color = TARGET_FLAG), size = 0.5)

Home value looks to have predictive value, with a higher median among customers without claims.

boxplot(HOME_VAL ~ TARGET_FLAG, data = train1, ylab="HOME_VAL", names=c("no claim","with claim"))

This gap between higher value home owners with and without claims is further supported by this density plot.

a <- ggplot(train1, aes(x = train1$HOME_VAL))
a + geom_density(aes(fill = factor(TARGET_FLAG), alpha=0.4))

Also the median past claim amount is higher for customers with claims than for those without.

boxplot(OLDCLAIM ~ TARGET_FLAG, data = train1, ylab="OLDCLAIM", names=c("no claim","with claim"))

(3) BUILD MODELS

Since the response variable for the first problem is a binary, we use binomial logistic regression.

train2 <- train1[-c(1,3)]
mod1 <- glm(TARGET_FLAG~., train2, family=binomial)
summary(mod1)
## 
## Call:
## glm(formula = TARGET_FLAG ~ ., family = binomial, data = train2)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.4080  -0.7086  -0.3939   0.6328   3.1667  
## 
## Coefficients:
##                                   Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                     -1.063e+00  3.719e-01  -2.859 0.004243 ** 
## KIDSDRIV1                        3.522e-01  1.302e-01   2.704 0.006847 ** 
## KIDSDRIV2                        7.090e-01  1.842e-01   3.850 0.000118 ***
## KIDSDRIV3                        8.120e-01  3.497e-01   2.322 0.020239 *  
## KIDSDRIV4                       -1.152e+01  2.035e+02  -0.057 0.954833    
## AGE                              2.202e-04  4.702e-03   0.047 0.962649    
## HOMEKIDS1                        3.924e-01  1.341e-01   2.926 0.003438 ** 
## HOMEKIDS2                        2.702e-01  1.324e-01   2.041 0.041238 *  
## HOMEKIDS3                        1.695e-01  1.542e-01   1.100 0.271534    
## HOMEKIDS4                        1.012e-01  2.420e-01   0.418 0.675819    
## HOMEKIDS5                        4.411e-01  7.503e-01   0.588 0.556654    
## YOJ                             -1.173e-02  9.618e-03  -1.219 0.222781    
## INCOME                          -2.968e-06  1.267e-06  -2.343 0.019144 *  
## PARENT1Yes                       2.486e-01  1.363e-01   1.824 0.068207 .  
## HOME_VAL                        -1.236e-06  3.906e-07  -3.165 0.001549 ** 
## MSTATUSz_No                      5.227e-01  1.007e-01   5.191 2.09e-07 ***
## SEXz_F                          -2.033e-01  1.245e-01  -1.632 0.102621    
## EDUCATIONBachelors              -3.823e-01  1.310e-01  -2.918 0.003525 ** 
## EDUCATIONMasters                -3.392e-01  2.061e-01  -1.646 0.099770 .  
## EDUCATIONPhD                    -1.058e-01  2.454e-01  -0.431 0.666429    
## EDUCATIONz_High School          -9.990e-03  1.068e-01  -0.094 0.925451    
## JOBClerical                      5.076e-01  2.246e-01   2.260 0.023831 *  
## JOBDoctor                       -1.827e-01  2.887e-01  -0.633 0.526760    
## JOBHome Maker                    2.218e-01  2.420e-01   0.917 0.359394    
## JOBLawyer                        2.658e-01  1.912e-01   1.390 0.164505    
## JOBManager                      -5.721e-01  1.982e-01  -2.887 0.003893 ** 
## JOBProfessional                  2.106e-01  2.037e-01   1.034 0.301042    
## JOBStudent                       1.782e-01  2.482e-01   0.718 0.472736    
## JOBz_Blue Collar                 3.126e-01  2.126e-01   1.470 0.141500    
## TRAVTIME                         1.577e-02  2.125e-03   7.424 1.14e-13 ***
## CAR_USEPrivate                  -8.296e-01  1.044e-01  -7.946 1.92e-15 ***
## BLUEBOOK                        -2.051e-05  5.901e-06  -3.476 0.000509 ***
## TIF                             -5.375e-02  8.289e-03  -6.485 8.89e-11 ***
## CAR_TYPEPanel Truck              5.796e-01  1.814e-01   3.196 0.001395 ** 
## CAR_TYPEPickup                   5.218e-01  1.139e-01   4.583 4.58e-06 ***
## CAR_TYPESports Car               1.132e+00  1.452e-01   7.791 6.64e-15 ***
## CAR_TYPEVan                      6.134e-01  1.427e-01   4.299 1.71e-05 ***
## CAR_TYPEz_SUV                    8.522e-01  1.245e-01   6.847 7.57e-12 ***
## RED_CARyes                      -1.124e-01  9.722e-02  -1.156 0.247619    
## OLDCLAIM                        -1.857e-05  4.712e-06  -3.941 8.12e-05 ***
## CLM_FREQ1                        5.611e-01  1.135e-01   4.943 7.68e-07 ***
## CLM_FREQ2                        6.408e-01  1.063e-01   6.030 1.64e-09 ***
## CLM_FREQ3                        6.259e-01  1.196e-01   5.231 1.69e-07 ***
## CLM_FREQ4                        7.651e-01  1.950e-01   3.923 8.74e-05 ***
## CLM_FREQ5                        9.112e-01  6.705e-01   1.359 0.174131    
## REVOKEDYes                       9.373e-01  1.052e-01   8.912  < 2e-16 ***
## MVR_PTS                          9.934e-02  1.587e-02   6.260 3.84e-10 ***
## CAR_AGE                         -7.039e-03  8.474e-03  -0.831 0.406152    
## URBANICITYz_Highly Rural/ Rural -2.268e+00  1.246e-01 -18.200  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 7445.1  on 6447  degrees of freedom
## Residual deviance: 5736.7  on 6399  degrees of freedom
## AIC: 5834.7
## 
## Number of Fisher Scoring iterations: 11
# AIC: 5834.7

Checking for unitary confidence intervals that fall about 0, we have the following:

AGE YOJ PARENT1 SEX RED_CAR CAR_AGE

(ci <- round(confint.default(mod1),3))
##                                    2.5 %  97.5 %
## (Intercept)                       -1.792  -0.334
## KIDSDRIV1                          0.097   0.607
## KIDSDRIV2                          0.348   1.070
## KIDSDRIV3                          0.127   1.497
## KIDSDRIV4                       -410.327 387.278
## AGE                               -0.009   0.009
## HOMEKIDS1                          0.130   0.655
## HOMEKIDS2                          0.011   0.530
## HOMEKIDS3                         -0.133   0.472
## HOMEKIDS4                         -0.373   0.575
## HOMEKIDS5                         -1.030   1.912
## YOJ                               -0.031   0.007
## INCOME                             0.000   0.000
## PARENT1Yes                        -0.019   0.516
## HOME_VAL                           0.000   0.000
## MSTATUSz_No                        0.325   0.720
## SEXz_F                            -0.447   0.041
## EDUCATIONBachelors                -0.639  -0.125
## EDUCATIONMasters                  -0.743   0.065
## EDUCATIONPhD                      -0.587   0.375
## EDUCATIONz_High School            -0.219   0.199
## JOBClerical                        0.067   0.948
## JOBDoctor                         -0.749   0.383
## JOBHome Maker                     -0.253   0.696
## JOBLawyer                         -0.109   0.640
## JOBManager                        -0.961  -0.184
## JOBProfessional                   -0.189   0.610
## JOBStudent                        -0.308   0.665
## JOBz_Blue Collar                  -0.104   0.729
## TRAVTIME                           0.012   0.020
## CAR_USEPrivate                    -1.034  -0.625
## BLUEBOOK                           0.000   0.000
## TIF                               -0.070  -0.038
## CAR_TYPEPanel Truck                0.224   0.935
## CAR_TYPEPickup                     0.299   0.745
## CAR_TYPESports Car                 0.847   1.416
## CAR_TYPEVan                        0.334   0.893
## CAR_TYPEz_SUV                      0.608   1.096
## RED_CARyes                        -0.303   0.078
## OLDCLAIM                           0.000   0.000
## CLM_FREQ1                          0.339   0.784
## CLM_FREQ2                          0.432   0.849
## CLM_FREQ3                          0.391   0.860
## CLM_FREQ4                          0.383   1.147
## CLM_FREQ5                         -0.403   2.225
## REVOKEDYes                         0.731   1.143
## MVR_PTS                            0.068   0.130
## CAR_AGE                           -0.024   0.010
## URBANICITYz_Highly Rural/ Rural   -2.513  -2.024

AIC goes down slightly with the removal of those six variables, improving the model’s parsimonous without sacrificing performance much. The residual deviance does go up from 5736.7 on 6399 degrees of freedom to 5745.4 on 6405 degrees of freedom.

Finally, after noting that the 8 JOB dummy variables are not contributing much to performance. We tolerate the gain in AIC to 5872.6 from 5831.4, and rise in residual deviance with their removal.

mod3 <- glm(TARGET_FLAG~.-AGE-YOJ-PARENT1-SEX-RED_CAR-CAR_AGE-JOB, train2, family=binomial)
summary(mod3)
## 
## Call:
## glm(formula = TARGET_FLAG ~ . - AGE - YOJ - PARENT1 - SEX - RED_CAR - 
##     CAR_AGE - JOB, family = binomial, data = train2)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.3599  -0.7197  -0.4041   0.6478   3.1750  
## 
## Coefficients:
##                                   Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                     -9.300e-01  1.860e-01  -5.000 5.74e-07 ***
## KIDSDRIV1                        3.538e-01  1.272e-01   2.782 0.005396 ** 
## KIDSDRIV2                        6.703e-01  1.803e-01   3.718 0.000201 ***
## KIDSDRIV3                        7.474e-01  3.443e-01   2.171 0.029934 *  
## KIDSDRIV4                       -1.159e+01  2.058e+02  -0.056 0.955081    
## HOMEKIDS1                        5.168e-01  1.051e-01   4.920 8.67e-07 ***
## HOMEKIDS2                        3.744e-01  1.048e-01   3.573 0.000352 ***
## HOMEKIDS3                        2.766e-01  1.290e-01   2.144 0.032024 *  
## HOMEKIDS4                        1.923e-01  2.211e-01   0.870 0.384480    
## HOMEKIDS5                        5.312e-01  7.366e-01   0.721 0.470848    
## INCOME                          -4.288e-06  1.131e-06  -3.792 0.000150 ***
## HOME_VAL                        -1.051e-06  3.758e-07  -2.798 0.005145 ** 
## MSTATUSz_No                      6.424e-01  8.133e-02   7.898 2.83e-15 ***
## EDUCATIONBachelors              -5.950e-01  1.111e-01  -5.354 8.60e-08 ***
## EDUCATIONMasters                -6.281e-01  1.246e-01  -5.041 4.62e-07 ***
## EDUCATIONPhD                    -5.122e-01  1.683e-01  -3.044 0.002333 ** 
## EDUCATIONz_High School          -9.102e-02  1.031e-01  -0.883 0.377502    
## TRAVTIME                         1.615e-02  2.107e-03   7.666 1.78e-14 ***
## CAR_USEPrivate                  -8.733e-01  8.286e-02 -10.540  < 2e-16 ***
## BLUEBOOK                        -2.352e-05  5.294e-06  -4.443 8.88e-06 ***
## TIF                             -5.328e-02  8.233e-03  -6.471 9.73e-11 ***
## CAR_TYPEPanel Truck              5.646e-01  1.605e-01   3.518 0.000435 ***
## CAR_TYPEPickup                   4.641e-01  1.106e-01   4.197 2.70e-05 ***
## CAR_TYPESports Car               1.007e+00  1.185e-01   8.495  < 2e-16 ***
## CAR_TYPEVan                      6.198e-01  1.356e-01   4.569 4.90e-06 ***
## CAR_TYPEz_SUV                    7.556e-01  9.581e-02   7.886 3.11e-15 ***
## OLDCLAIM                        -1.929e-05  4.660e-06  -4.140 3.48e-05 ***
## CLM_FREQ1                        5.683e-01  1.126e-01   5.048 4.47e-07 ***
## CLM_FREQ2                        6.524e-01  1.056e-01   6.178 6.48e-10 ***
## CLM_FREQ3                        6.309e-01  1.188e-01   5.311 1.09e-07 ***
## CLM_FREQ4                        7.580e-01  1.930e-01   3.928 8.57e-05 ***
## CLM_FREQ5                        6.775e-01  6.618e-01   1.024 0.305930    
## REVOKEDYes                       9.470e-01  1.041e-01   9.094  < 2e-16 ***
## MVR_PTS                          1.029e-01  1.573e-02   6.541 6.12e-11 ***
## URBANICITYz_Highly Rural/ Rural -2.205e+00  1.240e-01 -17.777  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 7445.1  on 6447  degrees of freedom
## Residual deviance: 5802.6  on 6413  degrees of freedom
## AIC: 5872.6
## 
## Number of Fisher Scoring iterations: 11
#AIC: 5872.6
train2$predict <- fitted(mod3, type = "response")
mean(train2$predict)
## [1] 0.2641129

Exactly 26.4% is the average scored probability for all cases.

train2$t50 <- ifelse(fitted(mod3) > 0.50, 1, 0)
mean(train2$t50)
## [1] 0.1670285

However, when we take a 50% threshold to predict binary responses, we see the mean drops to 16.7%, suggesting that our model does not perform well in detecting eventual claims (predicting positive claims), in other words there is an overprevalence of false negatives. This plot also indicates that with heavy density of point in Quadrant II (non-predicted actual claims)

ggplot(train2, aes(t50, TARGET_FLAG)) +
geom_jitter(aes(color = TARGET_FLAG), size = 0.5)

If the null hypothesis is that the case will not result in a claim, a type II error is the failure to reject a false null hypothesis (a “false negative”). This means its sensitivity is too low. We can confirm this with the caret package’s confusionMatrix() function.

actual <- as.factor(train2[,1])
predicted <- as.factor(train2[,26])
cmatrix.t <- t(table(actual,predicted))
Caret_cmat <- confusionMatrix(cmatrix.t, positive = "1")
Caret_cmat 
## Confusion Matrix and Statistics
## 
##          actual
## predicted    0    1
##         0 4368 1003
##         1  377  700
##                                           
##                Accuracy : 0.786           
##                  95% CI : (0.7758, 0.7959)
##     No Information Rate : 0.7359          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.3759          
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.4110          
##             Specificity : 0.9205          
##          Pos Pred Value : 0.6500          
##          Neg Pred Value : 0.8133          
##              Prevalence : 0.2641          
##          Detection Rate : 0.1086          
##    Detection Prevalence : 0.1670          
##       Balanced Accuracy : 0.6658          
##                                           
##        'Positive' Class : 1               
## 

We rate our model using the ROC (receiver operating characteristic) Curve, with a modest result of 81.2% area under the curve. This graphical plot illustrates the performance of a binary classifier system as its discrimination threshold is varied.

rocCurve <- roc(train2$TARGET_FLAG, train2$predict, levels=c(0,1))
plot(rocCurve, legacy.axes = TRUE)

The area under the curve is 81.2%.

auc(rocCurve)
## Area under the curve: 0.8116

In summary we selected our model based on AIC, residual deviance, confusion matrix metrics and ROC area under the curve.

(4) SELECT MODELS

Next, we move to creating a model for predicting the expected damages reported of filed claims. There are 1703 such claims in our training set.

train3 <- train1
train3 <- train3[train3$TARGET_FLAG == 1,]
train3 <- train3[-c(1,2)]
nrow(train3)
## [1] 1703
head(train3)
##    TARGET_AMT KIDSDRIV AGE HOMEKIDS YOJ INCOME PARENT1 HOME_VAL MSTATUS
## 6    2946.000        0  34        1  12 125301     Yes        0    z_No
## 9    2501.000        0  34        0  10  62978      No        0    z_No
## 11   6077.000        0  53        0  14  77100      No        0    z_No
## 14   1267.000        0  53        0  11 130795      No        0    z_No
## 15   2920.167        0  45        0   0      0      No   106859     Yes
## 20   6857.000        0  28        1  13  44077      No   170598     Yes
##    SEX     EDUCATION           JOB TRAVTIME    CAR_USE BLUEBOOK TIF
## 6  z_F     Bachelors z_Blue Collar       46 Commercial    17430   1
## 9  z_F     Bachelors      Clerical       34    Private    11200   1
## 11 z_F       Masters        Lawyer       15    Private    18300   1
## 14   M           PhD                     64 Commercial    28340   6
## 15 z_F  <High School    Home Maker       48    Private     6000   1
## 20 z_F z_High School z_Blue Collar       29 Commercial     8710   6
##       CAR_TYPE RED_CAR OLDCLAIM CLM_FREQ REVOKED MVR_PTS CAR_AGE
## 6   Sports Car      no        0        0      No       0       7
## 9        z_SUV      no        0        0      No       0       1
## 11  Sports Car      no        0        0      No       0      11
## 14 Panel Truck     yes        0        0      No       3      10
## 15       z_SUV      no        0        0      No       3       5
## 20       z_SUV      no     8935        2      No       0       1
##             URBANICITY
## 6  Highly Urban/ Urban
## 9  Highly Urban/ Urban
## 11 Highly Urban/ Urban
## 14 Highly Urban/ Urban
## 15 Highly Urban/ Urban
## 20 Highly Urban/ Urban

The full model with all explanatory variables has a low goodness of fit with Adjusted R-squared of only 0.02278. The covariate with the lowest p-value is BLUEBOOK value, suggesting it is most useful.

mod4 <- lm(TARGET_AMT~., train3)
summary(mod4)
## 
## Call:
## lm(formula = TARGET_AMT ~ ., data = train3)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -11214  -3202  -1449    675  75880 
## 
## Coefficients:
##                                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                      3.734e+03  2.133e+03   1.750  0.08022 .  
## KIDSDRIV1                        6.633e+02  7.052e+02   0.941  0.34709    
## KIDSDRIV2                       -1.162e+03  9.433e+02  -1.232  0.21821    
## KIDSDRIV3                       -8.082e+02  1.671e+03  -0.484  0.62871    
## AGE                              8.248e-01  2.466e+01   0.033  0.97332    
## HOMEKIDS1                        4.742e+02  7.650e+02   0.620  0.53545    
## HOMEKIDS2                        1.758e+03  7.533e+02   2.333  0.01975 *  
## HOMEKIDS3                        6.053e+02  8.472e+02   0.714  0.47505    
## HOMEKIDS4                        6.801e+02  1.303e+03   0.522  0.60170    
## HOMEKIDS5                        1.268e+03  3.905e+03   0.325  0.74551    
## YOJ                              1.178e+01  5.484e+01   0.215  0.82992    
## INCOME                          -1.580e-02  7.860e-03  -2.011  0.04450 *  
## PARENT1Yes                      -6.457e+02  7.500e+02  -0.861  0.38942    
## HOME_VAL                         2.182e-03  2.271e-03   0.960  0.33698    
## MSTATUSz_No                      1.620e+03  5.978e+02   2.710  0.00680 ** 
## SEXz_F                          -2.007e+03  7.195e+02  -2.789  0.00535 ** 
## EDUCATIONBachelors               1.039e+02  7.181e+02   0.145  0.88498    
## EDUCATIONMasters                 8.512e+02  1.253e+03   0.679  0.49698    
## EDUCATIONPhD                     3.152e+03  1.485e+03   2.123  0.03390 *  
## EDUCATIONz_High School          -7.102e+02  5.656e+02  -1.256  0.20946    
## JOBClerical                     -6.853e+02  1.367e+03  -0.501  0.61622    
## JOBDoctor                       -3.556e+03  1.875e+03  -1.897  0.05807 .  
## JOBHome Maker                   -5.814e+02  1.448e+03  -0.401  0.68820    
## JOBLawyer                       -2.301e+02  1.149e+03  -0.200  0.84132    
## JOBManager                      -1.336e+03  1.226e+03  -1.090  0.27587    
## JOBProfessional                  1.312e+03  1.283e+03   1.022  0.30676    
## JOBStudent                      -6.343e+02  1.477e+03  -0.429  0.66775    
## JOBz_Blue Collar                 4.918e+02  1.306e+03   0.377  0.70657    
## TRAVTIME                         4.357e+00  1.238e+01   0.352  0.72487    
## CAR_USEPrivate                  -2.944e+02  5.869e+02  -0.502  0.61597    
## BLUEBOOK                         1.516e-01  3.384e-02   4.480 7.96e-06 ***
## TIF                             -4.760e+00  4.713e+01  -0.101  0.91957    
## CAR_TYPEPanel Truck             -3.201e+02  1.057e+03  -0.303  0.76207    
## CAR_TYPEPickup                   3.444e+02  6.643e+02   0.518  0.60426    
## CAR_TYPESports Car               2.118e+03  8.280e+02   2.558  0.01061 *  
## CAR_TYPEVan                     -2.606e+02  8.608e+02  -0.303  0.76208    
## CAR_TYPEz_SUV                    1.851e+03  7.390e+02   2.505  0.01234 *  
## RED_CARyes                      -2.594e+02  5.528e+02  -0.469  0.63892    
## OLDCLAIM                         4.486e-02  2.755e-02   1.628  0.10367    
## CLM_FREQ1                        2.982e+02  6.288e+02   0.474  0.63538    
## CLM_FREQ2                       -4.174e+02  5.867e+02  -0.711  0.47688    
## CLM_FREQ3                       -8.287e+02  6.467e+02  -1.281  0.20026    
## CLM_FREQ4                        4.424e+01  1.047e+03   0.042  0.96631    
## CLM_FREQ5                       -1.444e+03  3.457e+03  -0.418  0.67632    
## REVOKEDYes                      -1.186e+03  6.015e+02  -1.972  0.04878 *  
## MVR_PTS                          7.061e+01  7.799e+01   0.905  0.36536    
## CAR_AGE                         -9.724e+01  4.886e+01  -1.990  0.04672 *  
## URBANICITYz_Highly Rural/ Rural -7.134e+01  8.203e+02  -0.087  0.93070    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7581 on 1655 degrees of freedom
## Multiple R-squared:  0.04977,    Adjusted R-squared:  0.02278 
## F-statistic: 1.844 on 47 and 1655 DF,  p-value: 0.0004915
#Adjusted R-squared:  0.01829 

We settled on a model using forward selection, taking the bluebook value, marital status and car age as our three strongest predictors.

mod5 <- lm(TARGET_AMT~BLUEBOOK+MSTATUS+CAR_AGE, train3)
summary(mod5)
## 
## Call:
## lm(formula = TARGET_AMT ~ BLUEBOOK + MSTATUS + CAR_AGE, data = train3)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -8474  -3080  -1497    393  78051 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 3927.91538  446.64722   8.794  < 2e-16 ***
## BLUEBOOK       0.12024    0.02238   5.372 8.85e-08 ***
## MSTATUSz_No  857.41660  368.98923   2.324   0.0203 *  
## CAR_AGE      -57.37223   34.03320  -1.686   0.0920 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7599 on 1699 degrees of freedom
## Multiple R-squared:  0.01994,    Adjusted R-squared:  0.01821 
## F-statistic: 11.53 on 3 and 1699 DF,  p-value: 1.765e-07
#Adjusted R-squared:  0.01484 

We see from the scatterplot here that we only have three cases in which predicted damage is over $10,000, which occurs often in actuality. Our model also does not predict claims of lower value (under $4000)

plot(fitted(mod5),train3$TARGET_AMT)

Zooming in and ignoring actual reports over $10,000, we notice that our fitted values seem to have no correlation to the target amount.

plot(fitted(mod5),train3$TARGET_AMT, ylim=c(0,10000))

From this zoomed in plot of the residuals, we can see that the residuals drift negative as predicted claim increases. While we have residuals that strongly positive, due to actual claims being much higher, there are few large negative residuals. That is confirmed in the histogram below.

plot(fitted(mod5),mod5$residuals, ylim=c(-20000,20000))

hist(mod5$residuals, breaks = 100)

Ultimately our model is too conservative. Simply put, there are too many unidentified situational variables that would affect the amount of damages filed.