From the summary() function, the training set has 8161 observations on 26 variables, which include an index and two response variables (TARGET_FLAG and TARGET_AMT).
From the summary, one notes the Mean :0.2638, the same as the percentage of cases filing a claim.
We also note the number of NA values for AGE (6), YOJ (454), INCOME (445), HOME_VAL (464), CAR_AGE (510).
sum(train$TARGET_FLAG == 0)
## [1] 6008
sum(train$TARGET_AMT == 0)
## [1] 6008
Of the 8161 total cases, 6008 of them resulted in no claim. We confirm, for each no claim case, the claim amount is 0.
There are also a few quantitive variables that are best expressed as categorical ordinal variables.
train$KIDSDRIV <- as.factor(train$KIDSDRIV)
train$HOMEKIDS <- as.factor(train$HOMEKIDS)
train$CLM_FREQ <- as.factor(train$CLM_FREQ)
There are several categorical variables that are best expressed as quantitative ones.
train$INCOME <- as.numeric(gsub('\\$|,', '', train$INCOME))
train$HOME_VAL <- as.numeric(gsub('\\$|,', '', train$HOME_VAL))
train$BLUEBOOK <- as.numeric(gsub('\\$|,', '', train$BLUEBOOK))
train$OLDCLAIM <- as.numeric(gsub('\\$|,', '', train$OLDCLAIM))
Finally removing all cases with null values, leaving 6448 cases, of which 4745 are without claims (73.6%).
train1 <- train[complete.cases(train),]
nrow(train1)
## [1] 6448
sum(train1$TARGET_FLAG == 0) / nrow(train1)
## [1] 0.7358871
#6448 entries
summary(train1)
## INDEX TARGET_FLAG TARGET_AMT KIDSDRIV AGE
## Min. : 1 Min. :0.0000 Min. : 0 0:5683 Min. :16.00
## 1st Qu.: 2472 1st Qu.:0.0000 1st Qu.: 0 1: 493 1st Qu.:39.00
## Median : 5034 Median :0.0000 Median : 0 2: 220 Median :45.00
## Mean : 5083 Mean :0.2641 Mean : 1497 3: 50 Mean :44.74
## 3rd Qu.: 7671 3rd Qu.:1.0000 3rd Qu.: 1036 4: 2 3rd Qu.:51.00
## Max. :10302 Max. :1.0000 Max. :85524 Max. :81.00
##
## HOMEKIDS YOJ INCOME PARENT1 HOME_VAL
## 0:4178 Min. : 0.00 Min. : 0 No :5590 Min. : 0
## 1: 706 1st Qu.: 9.00 1st Qu.: 28180 Yes: 858 1st Qu.: 0
## 2: 876 Median :11.00 Median : 54287 Median :162331
## 3: 544 Mean :10.54 Mean : 62020 Mean :155438
## 4: 133 3rd Qu.:13.00 3rd Qu.: 86438 3rd Qu.:239934
## 5: 11 Max. :23.00 Max. :367030 Max. :885282
##
## MSTATUS SEX EDUCATION JOB
## Yes :3832 M :3000 <High School : 955 z_Blue Collar:1476
## z_No:2616 z_F:3448 Bachelors :1740 Clerical :1031
## Masters :1313 Professional : 868
## PhD : 568 Manager : 779
## z_High School:1872 Lawyer : 670
## Student : 537
## (Other) :1087
## TRAVTIME CAR_USE BLUEBOOK TIF
## Min. : 5.00 Commercial:2404 Min. : 1500 Min. : 1.000
## 1st Qu.: 23.00 Private :4044 1st Qu.: 9388 1st Qu.: 1.000
## Median : 33.00 Median :14460 Median : 4.000
## Mean : 33.66 Mean :15761 Mean : 5.377
## 3rd Qu.: 44.00 3rd Qu.:20820 3rd Qu.: 7.000
## Max. :142.00 Max. :69740 Max. :25.000
##
## CAR_TYPE RED_CAR OLDCLAIM CLM_FREQ REVOKED
## Minivan :1716 no :4566 Min. : 0 0:3984 No :5658
## Panel Truck: 528 yes:1882 1st Qu.: 0 1: 765 Yes: 790
## Pickup :1111 Median : 0 2: 915
## Sports Car : 719 Mean : 4046 3: 615
## Van : 577 3rd Qu.: 4600 4: 156
## z_SUV :1797 Max. :57037 5: 13
##
## MVR_PTS CAR_AGE URBANICITY
## Min. : 0.000 Min. :-3.000 Highly Urban/ Urban :5134
## 1st Qu.: 0.000 1st Qu.: 1.000 z_Highly Rural/ Rural:1314
## Median : 1.000 Median : 8.000
## Mean : 1.705 Mean : 8.296
## 3rd Qu.: 3.000 3rd Qu.:12.000
## Max. :13.000 Max. :28.000
##
Further for variables with multiple categories, we construct probability tables, to check for features associated with higher incidence of claim. The ones we investigate are: KIDSDRIV, HOMEKIDS, EDUCATION, JOB, and CAR_TYPE.
KIDSDRIV
Having even one kid who drives raises the incidence of making a claim. As number of children increases the ratio of claims to non-claims increases.
ggplot(train, aes(KIDSDRIV, TARGET_FLAG)) +
geom_jitter(aes(color = TARGET_FLAG), size = 0.5)
HOMEKIDS
Indeed, simply having kids makes it more likely one would have made a damage claims.
EDUCATION
We can view from this probability table that High School educated people have a far great share of the claims cases, while Masters’ and PhD’s are lower than their share of the non-claims cases.
JOB
Doctors, lawyers, and managers indeed have lower incidence of claims than students and blue collars.
CAR_TYPE
From the probability table of CAR_TYPE we note a higher incidence of accidents among SUVs, sports cars, and pick ups, and significantly lower incidence among minivans.
Here we visualize the age against claim occurrance, looking for indications if younger and older drivers are more likely to get into more accidents.
ggplot(train, aes(AGE, TARGET_FLAG)) +
geom_jitter(aes(color = TARGET_FLAG), size = 0.5)
## Warning: Removed 6 rows containing missing values (geom_point).
For lower ages, higher crash rate does seem to be the case.
In this visualization, we are not convinced that increased travel time to work has an obvious positive impact on claim incidence as might be expected.
ggplot(train1, aes(TRAVTIME, TARGET_FLAG)) +
geom_jitter(aes(color = TARGET_FLAG), size = 0.5)
Home value looks to have predictive value, with a higher median among customers without claims.
boxplot(HOME_VAL ~ TARGET_FLAG, data = train1, ylab="HOME_VAL", names=c("no claim","with claim"))
This gap between higher value home owners with and without claims is further supported by this density plot.
a <- ggplot(train1, aes(x = train1$HOME_VAL))
a + geom_density(aes(fill = factor(TARGET_FLAG), alpha=0.4))
Also the median past claim amount is higher for customers with claims than for those without.
boxplot(OLDCLAIM ~ TARGET_FLAG, data = train1, ylab="OLDCLAIM", names=c("no claim","with claim"))
Since the response variable for the first problem is a binary, we use binomial logistic regression.
train2 <- train1[-c(1,3)]
mod1 <- glm(TARGET_FLAG~., train2, family=binomial)
summary(mod1)
##
## Call:
## glm(formula = TARGET_FLAG ~ ., family = binomial, data = train2)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.4080 -0.7086 -0.3939 0.6328 3.1667
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.063e+00 3.719e-01 -2.859 0.004243 **
## KIDSDRIV1 3.522e-01 1.302e-01 2.704 0.006847 **
## KIDSDRIV2 7.090e-01 1.842e-01 3.850 0.000118 ***
## KIDSDRIV3 8.120e-01 3.497e-01 2.322 0.020239 *
## KIDSDRIV4 -1.152e+01 2.035e+02 -0.057 0.954833
## AGE 2.202e-04 4.702e-03 0.047 0.962649
## HOMEKIDS1 3.924e-01 1.341e-01 2.926 0.003438 **
## HOMEKIDS2 2.702e-01 1.324e-01 2.041 0.041238 *
## HOMEKIDS3 1.695e-01 1.542e-01 1.100 0.271534
## HOMEKIDS4 1.012e-01 2.420e-01 0.418 0.675819
## HOMEKIDS5 4.411e-01 7.503e-01 0.588 0.556654
## YOJ -1.173e-02 9.618e-03 -1.219 0.222781
## INCOME -2.968e-06 1.267e-06 -2.343 0.019144 *
## PARENT1Yes 2.486e-01 1.363e-01 1.824 0.068207 .
## HOME_VAL -1.236e-06 3.906e-07 -3.165 0.001549 **
## MSTATUSz_No 5.227e-01 1.007e-01 5.191 2.09e-07 ***
## SEXz_F -2.033e-01 1.245e-01 -1.632 0.102621
## EDUCATIONBachelors -3.823e-01 1.310e-01 -2.918 0.003525 **
## EDUCATIONMasters -3.392e-01 2.061e-01 -1.646 0.099770 .
## EDUCATIONPhD -1.058e-01 2.454e-01 -0.431 0.666429
## EDUCATIONz_High School -9.990e-03 1.068e-01 -0.094 0.925451
## JOBClerical 5.076e-01 2.246e-01 2.260 0.023831 *
## JOBDoctor -1.827e-01 2.887e-01 -0.633 0.526760
## JOBHome Maker 2.218e-01 2.420e-01 0.917 0.359394
## JOBLawyer 2.658e-01 1.912e-01 1.390 0.164505
## JOBManager -5.721e-01 1.982e-01 -2.887 0.003893 **
## JOBProfessional 2.106e-01 2.037e-01 1.034 0.301042
## JOBStudent 1.782e-01 2.482e-01 0.718 0.472736
## JOBz_Blue Collar 3.126e-01 2.126e-01 1.470 0.141500
## TRAVTIME 1.577e-02 2.125e-03 7.424 1.14e-13 ***
## CAR_USEPrivate -8.296e-01 1.044e-01 -7.946 1.92e-15 ***
## BLUEBOOK -2.051e-05 5.901e-06 -3.476 0.000509 ***
## TIF -5.375e-02 8.289e-03 -6.485 8.89e-11 ***
## CAR_TYPEPanel Truck 5.796e-01 1.814e-01 3.196 0.001395 **
## CAR_TYPEPickup 5.218e-01 1.139e-01 4.583 4.58e-06 ***
## CAR_TYPESports Car 1.132e+00 1.452e-01 7.791 6.64e-15 ***
## CAR_TYPEVan 6.134e-01 1.427e-01 4.299 1.71e-05 ***
## CAR_TYPEz_SUV 8.522e-01 1.245e-01 6.847 7.57e-12 ***
## RED_CARyes -1.124e-01 9.722e-02 -1.156 0.247619
## OLDCLAIM -1.857e-05 4.712e-06 -3.941 8.12e-05 ***
## CLM_FREQ1 5.611e-01 1.135e-01 4.943 7.68e-07 ***
## CLM_FREQ2 6.408e-01 1.063e-01 6.030 1.64e-09 ***
## CLM_FREQ3 6.259e-01 1.196e-01 5.231 1.69e-07 ***
## CLM_FREQ4 7.651e-01 1.950e-01 3.923 8.74e-05 ***
## CLM_FREQ5 9.112e-01 6.705e-01 1.359 0.174131
## REVOKEDYes 9.373e-01 1.052e-01 8.912 < 2e-16 ***
## MVR_PTS 9.934e-02 1.587e-02 6.260 3.84e-10 ***
## CAR_AGE -7.039e-03 8.474e-03 -0.831 0.406152
## URBANICITYz_Highly Rural/ Rural -2.268e+00 1.246e-01 -18.200 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 7445.1 on 6447 degrees of freedom
## Residual deviance: 5736.7 on 6399 degrees of freedom
## AIC: 5834.7
##
## Number of Fisher Scoring iterations: 11
# AIC: 5834.7
Checking for unitary confidence intervals that fall about 0, we have the following:
AGE YOJ PARENT1 SEX RED_CAR CAR_AGE
(ci <- round(confint.default(mod1),3))
## 2.5 % 97.5 %
## (Intercept) -1.792 -0.334
## KIDSDRIV1 0.097 0.607
## KIDSDRIV2 0.348 1.070
## KIDSDRIV3 0.127 1.497
## KIDSDRIV4 -410.327 387.278
## AGE -0.009 0.009
## HOMEKIDS1 0.130 0.655
## HOMEKIDS2 0.011 0.530
## HOMEKIDS3 -0.133 0.472
## HOMEKIDS4 -0.373 0.575
## HOMEKIDS5 -1.030 1.912
## YOJ -0.031 0.007
## INCOME 0.000 0.000
## PARENT1Yes -0.019 0.516
## HOME_VAL 0.000 0.000
## MSTATUSz_No 0.325 0.720
## SEXz_F -0.447 0.041
## EDUCATIONBachelors -0.639 -0.125
## EDUCATIONMasters -0.743 0.065
## EDUCATIONPhD -0.587 0.375
## EDUCATIONz_High School -0.219 0.199
## JOBClerical 0.067 0.948
## JOBDoctor -0.749 0.383
## JOBHome Maker -0.253 0.696
## JOBLawyer -0.109 0.640
## JOBManager -0.961 -0.184
## JOBProfessional -0.189 0.610
## JOBStudent -0.308 0.665
## JOBz_Blue Collar -0.104 0.729
## TRAVTIME 0.012 0.020
## CAR_USEPrivate -1.034 -0.625
## BLUEBOOK 0.000 0.000
## TIF -0.070 -0.038
## CAR_TYPEPanel Truck 0.224 0.935
## CAR_TYPEPickup 0.299 0.745
## CAR_TYPESports Car 0.847 1.416
## CAR_TYPEVan 0.334 0.893
## CAR_TYPEz_SUV 0.608 1.096
## RED_CARyes -0.303 0.078
## OLDCLAIM 0.000 0.000
## CLM_FREQ1 0.339 0.784
## CLM_FREQ2 0.432 0.849
## CLM_FREQ3 0.391 0.860
## CLM_FREQ4 0.383 1.147
## CLM_FREQ5 -0.403 2.225
## REVOKEDYes 0.731 1.143
## MVR_PTS 0.068 0.130
## CAR_AGE -0.024 0.010
## URBANICITYz_Highly Rural/ Rural -2.513 -2.024
AIC goes down slightly with the removal of those six variables, improving the model’s parsimonous without sacrificing performance much. The residual deviance does go up from 5736.7 on 6399 degrees of freedom to 5745.4 on 6405 degrees of freedom.
Finally, after noting that the 8 JOB dummy variables are not contributing much to performance. We tolerate the gain in AIC to 5872.6 from 5831.4, and rise in residual deviance with their removal.
mod3 <- glm(TARGET_FLAG~.-AGE-YOJ-PARENT1-SEX-RED_CAR-CAR_AGE-JOB, train2, family=binomial)
summary(mod3)
##
## Call:
## glm(formula = TARGET_FLAG ~ . - AGE - YOJ - PARENT1 - SEX - RED_CAR -
## CAR_AGE - JOB, family = binomial, data = train2)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.3599 -0.7197 -0.4041 0.6478 3.1750
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -9.300e-01 1.860e-01 -5.000 5.74e-07 ***
## KIDSDRIV1 3.538e-01 1.272e-01 2.782 0.005396 **
## KIDSDRIV2 6.703e-01 1.803e-01 3.718 0.000201 ***
## KIDSDRIV3 7.474e-01 3.443e-01 2.171 0.029934 *
## KIDSDRIV4 -1.159e+01 2.058e+02 -0.056 0.955081
## HOMEKIDS1 5.168e-01 1.051e-01 4.920 8.67e-07 ***
## HOMEKIDS2 3.744e-01 1.048e-01 3.573 0.000352 ***
## HOMEKIDS3 2.766e-01 1.290e-01 2.144 0.032024 *
## HOMEKIDS4 1.923e-01 2.211e-01 0.870 0.384480
## HOMEKIDS5 5.312e-01 7.366e-01 0.721 0.470848
## INCOME -4.288e-06 1.131e-06 -3.792 0.000150 ***
## HOME_VAL -1.051e-06 3.758e-07 -2.798 0.005145 **
## MSTATUSz_No 6.424e-01 8.133e-02 7.898 2.83e-15 ***
## EDUCATIONBachelors -5.950e-01 1.111e-01 -5.354 8.60e-08 ***
## EDUCATIONMasters -6.281e-01 1.246e-01 -5.041 4.62e-07 ***
## EDUCATIONPhD -5.122e-01 1.683e-01 -3.044 0.002333 **
## EDUCATIONz_High School -9.102e-02 1.031e-01 -0.883 0.377502
## TRAVTIME 1.615e-02 2.107e-03 7.666 1.78e-14 ***
## CAR_USEPrivate -8.733e-01 8.286e-02 -10.540 < 2e-16 ***
## BLUEBOOK -2.352e-05 5.294e-06 -4.443 8.88e-06 ***
## TIF -5.328e-02 8.233e-03 -6.471 9.73e-11 ***
## CAR_TYPEPanel Truck 5.646e-01 1.605e-01 3.518 0.000435 ***
## CAR_TYPEPickup 4.641e-01 1.106e-01 4.197 2.70e-05 ***
## CAR_TYPESports Car 1.007e+00 1.185e-01 8.495 < 2e-16 ***
## CAR_TYPEVan 6.198e-01 1.356e-01 4.569 4.90e-06 ***
## CAR_TYPEz_SUV 7.556e-01 9.581e-02 7.886 3.11e-15 ***
## OLDCLAIM -1.929e-05 4.660e-06 -4.140 3.48e-05 ***
## CLM_FREQ1 5.683e-01 1.126e-01 5.048 4.47e-07 ***
## CLM_FREQ2 6.524e-01 1.056e-01 6.178 6.48e-10 ***
## CLM_FREQ3 6.309e-01 1.188e-01 5.311 1.09e-07 ***
## CLM_FREQ4 7.580e-01 1.930e-01 3.928 8.57e-05 ***
## CLM_FREQ5 6.775e-01 6.618e-01 1.024 0.305930
## REVOKEDYes 9.470e-01 1.041e-01 9.094 < 2e-16 ***
## MVR_PTS 1.029e-01 1.573e-02 6.541 6.12e-11 ***
## URBANICITYz_Highly Rural/ Rural -2.205e+00 1.240e-01 -17.777 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 7445.1 on 6447 degrees of freedom
## Residual deviance: 5802.6 on 6413 degrees of freedom
## AIC: 5872.6
##
## Number of Fisher Scoring iterations: 11
#AIC: 5872.6
train2$predict <- fitted(mod3, type = "response")
mean(train2$predict)
## [1] 0.2641129
Exactly 26.4% is the average scored probability for all cases.
train2$t50 <- ifelse(fitted(mod3) > 0.50, 1, 0)
mean(train2$t50)
## [1] 0.1670285
However, when we take a 50% threshold to predict binary responses, we see the mean drops to 16.7%, suggesting that our model does not perform well in detecting eventual claims (predicting positive claims), in other words there is an overprevalence of false negatives. This plot also indicates that with heavy density of point in Quadrant II (non-predicted actual claims)
ggplot(train2, aes(t50, TARGET_FLAG)) +
geom_jitter(aes(color = TARGET_FLAG), size = 0.5)
If the null hypothesis is that the case will not result in a claim, a type II error is the failure to reject a false null hypothesis (a “false negative”). This means its sensitivity is too low. We can confirm this with the caret package’s confusionMatrix() function.
actual <- as.factor(train2[,1])
predicted <- as.factor(train2[,26])
cmatrix.t <- t(table(actual,predicted))
Caret_cmat <- confusionMatrix(cmatrix.t, positive = "1")
Caret_cmat
## Confusion Matrix and Statistics
##
## actual
## predicted 0 1
## 0 4368 1003
## 1 377 700
##
## Accuracy : 0.786
## 95% CI : (0.7758, 0.7959)
## No Information Rate : 0.7359
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.3759
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.4110
## Specificity : 0.9205
## Pos Pred Value : 0.6500
## Neg Pred Value : 0.8133
## Prevalence : 0.2641
## Detection Rate : 0.1086
## Detection Prevalence : 0.1670
## Balanced Accuracy : 0.6658
##
## 'Positive' Class : 1
##
We rate our model using the ROC (receiver operating characteristic) Curve, with a modest result of 81.2% area under the curve. This graphical plot illustrates the performance of a binary classifier system as its discrimination threshold is varied.
rocCurve <- roc(train2$TARGET_FLAG, train2$predict, levels=c(0,1))
plot(rocCurve, legacy.axes = TRUE)
The area under the curve is 81.2%.
auc(rocCurve)
## Area under the curve: 0.8116
In summary we selected our model based on AIC, residual deviance, confusion matrix metrics and ROC area under the curve.
Next, we move to creating a model for predicting the expected damages reported of filed claims. There are 1703 such claims in our training set.
train3 <- train1
train3 <- train3[train3$TARGET_FLAG == 1,]
train3 <- train3[-c(1,2)]
nrow(train3)
## [1] 1703
head(train3)
## TARGET_AMT KIDSDRIV AGE HOMEKIDS YOJ INCOME PARENT1 HOME_VAL MSTATUS
## 6 2946.000 0 34 1 12 125301 Yes 0 z_No
## 9 2501.000 0 34 0 10 62978 No 0 z_No
## 11 6077.000 0 53 0 14 77100 No 0 z_No
## 14 1267.000 0 53 0 11 130795 No 0 z_No
## 15 2920.167 0 45 0 0 0 No 106859 Yes
## 20 6857.000 0 28 1 13 44077 No 170598 Yes
## SEX EDUCATION JOB TRAVTIME CAR_USE BLUEBOOK TIF
## 6 z_F Bachelors z_Blue Collar 46 Commercial 17430 1
## 9 z_F Bachelors Clerical 34 Private 11200 1
## 11 z_F Masters Lawyer 15 Private 18300 1
## 14 M PhD 64 Commercial 28340 6
## 15 z_F <High School Home Maker 48 Private 6000 1
## 20 z_F z_High School z_Blue Collar 29 Commercial 8710 6
## CAR_TYPE RED_CAR OLDCLAIM CLM_FREQ REVOKED MVR_PTS CAR_AGE
## 6 Sports Car no 0 0 No 0 7
## 9 z_SUV no 0 0 No 0 1
## 11 Sports Car no 0 0 No 0 11
## 14 Panel Truck yes 0 0 No 3 10
## 15 z_SUV no 0 0 No 3 5
## 20 z_SUV no 8935 2 No 0 1
## URBANICITY
## 6 Highly Urban/ Urban
## 9 Highly Urban/ Urban
## 11 Highly Urban/ Urban
## 14 Highly Urban/ Urban
## 15 Highly Urban/ Urban
## 20 Highly Urban/ Urban
The full model with all explanatory variables has a low goodness of fit with Adjusted R-squared of only 0.02278. The covariate with the lowest p-value is BLUEBOOK value, suggesting it is most useful.
mod4 <- lm(TARGET_AMT~., train3)
summary(mod4)
##
## Call:
## lm(formula = TARGET_AMT ~ ., data = train3)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11214 -3202 -1449 675 75880
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.734e+03 2.133e+03 1.750 0.08022 .
## KIDSDRIV1 6.633e+02 7.052e+02 0.941 0.34709
## KIDSDRIV2 -1.162e+03 9.433e+02 -1.232 0.21821
## KIDSDRIV3 -8.082e+02 1.671e+03 -0.484 0.62871
## AGE 8.248e-01 2.466e+01 0.033 0.97332
## HOMEKIDS1 4.742e+02 7.650e+02 0.620 0.53545
## HOMEKIDS2 1.758e+03 7.533e+02 2.333 0.01975 *
## HOMEKIDS3 6.053e+02 8.472e+02 0.714 0.47505
## HOMEKIDS4 6.801e+02 1.303e+03 0.522 0.60170
## HOMEKIDS5 1.268e+03 3.905e+03 0.325 0.74551
## YOJ 1.178e+01 5.484e+01 0.215 0.82992
## INCOME -1.580e-02 7.860e-03 -2.011 0.04450 *
## PARENT1Yes -6.457e+02 7.500e+02 -0.861 0.38942
## HOME_VAL 2.182e-03 2.271e-03 0.960 0.33698
## MSTATUSz_No 1.620e+03 5.978e+02 2.710 0.00680 **
## SEXz_F -2.007e+03 7.195e+02 -2.789 0.00535 **
## EDUCATIONBachelors 1.039e+02 7.181e+02 0.145 0.88498
## EDUCATIONMasters 8.512e+02 1.253e+03 0.679 0.49698
## EDUCATIONPhD 3.152e+03 1.485e+03 2.123 0.03390 *
## EDUCATIONz_High School -7.102e+02 5.656e+02 -1.256 0.20946
## JOBClerical -6.853e+02 1.367e+03 -0.501 0.61622
## JOBDoctor -3.556e+03 1.875e+03 -1.897 0.05807 .
## JOBHome Maker -5.814e+02 1.448e+03 -0.401 0.68820
## JOBLawyer -2.301e+02 1.149e+03 -0.200 0.84132
## JOBManager -1.336e+03 1.226e+03 -1.090 0.27587
## JOBProfessional 1.312e+03 1.283e+03 1.022 0.30676
## JOBStudent -6.343e+02 1.477e+03 -0.429 0.66775
## JOBz_Blue Collar 4.918e+02 1.306e+03 0.377 0.70657
## TRAVTIME 4.357e+00 1.238e+01 0.352 0.72487
## CAR_USEPrivate -2.944e+02 5.869e+02 -0.502 0.61597
## BLUEBOOK 1.516e-01 3.384e-02 4.480 7.96e-06 ***
## TIF -4.760e+00 4.713e+01 -0.101 0.91957
## CAR_TYPEPanel Truck -3.201e+02 1.057e+03 -0.303 0.76207
## CAR_TYPEPickup 3.444e+02 6.643e+02 0.518 0.60426
## CAR_TYPESports Car 2.118e+03 8.280e+02 2.558 0.01061 *
## CAR_TYPEVan -2.606e+02 8.608e+02 -0.303 0.76208
## CAR_TYPEz_SUV 1.851e+03 7.390e+02 2.505 0.01234 *
## RED_CARyes -2.594e+02 5.528e+02 -0.469 0.63892
## OLDCLAIM 4.486e-02 2.755e-02 1.628 0.10367
## CLM_FREQ1 2.982e+02 6.288e+02 0.474 0.63538
## CLM_FREQ2 -4.174e+02 5.867e+02 -0.711 0.47688
## CLM_FREQ3 -8.287e+02 6.467e+02 -1.281 0.20026
## CLM_FREQ4 4.424e+01 1.047e+03 0.042 0.96631
## CLM_FREQ5 -1.444e+03 3.457e+03 -0.418 0.67632
## REVOKEDYes -1.186e+03 6.015e+02 -1.972 0.04878 *
## MVR_PTS 7.061e+01 7.799e+01 0.905 0.36536
## CAR_AGE -9.724e+01 4.886e+01 -1.990 0.04672 *
## URBANICITYz_Highly Rural/ Rural -7.134e+01 8.203e+02 -0.087 0.93070
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7581 on 1655 degrees of freedom
## Multiple R-squared: 0.04977, Adjusted R-squared: 0.02278
## F-statistic: 1.844 on 47 and 1655 DF, p-value: 0.0004915
#Adjusted R-squared: 0.01829
We settled on a model using forward selection, taking the bluebook value, marital status and car age as our three strongest predictors.
mod5 <- lm(TARGET_AMT~BLUEBOOK+MSTATUS+CAR_AGE, train3)
summary(mod5)
##
## Call:
## lm(formula = TARGET_AMT ~ BLUEBOOK + MSTATUS + CAR_AGE, data = train3)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8474 -3080 -1497 393 78051
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3927.91538 446.64722 8.794 < 2e-16 ***
## BLUEBOOK 0.12024 0.02238 5.372 8.85e-08 ***
## MSTATUSz_No 857.41660 368.98923 2.324 0.0203 *
## CAR_AGE -57.37223 34.03320 -1.686 0.0920 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7599 on 1699 degrees of freedom
## Multiple R-squared: 0.01994, Adjusted R-squared: 0.01821
## F-statistic: 11.53 on 3 and 1699 DF, p-value: 1.765e-07
#Adjusted R-squared: 0.01484
We see from the scatterplot here that we only have three cases in which predicted damage is over $10,000, which occurs often in actuality. Our model also does not predict claims of lower value (under $4000)
plot(fitted(mod5),train3$TARGET_AMT)
Zooming in and ignoring actual reports over $10,000, we notice that our fitted values seem to have no correlation to the target amount.
plot(fitted(mod5),train3$TARGET_AMT, ylim=c(0,10000))
From this zoomed in plot of the residuals, we can see that the residuals drift negative as predicted claim increases. While we have residuals that strongly positive, due to actual claims being much higher, there are few large negative residuals. That is confirmed in the histogram below.
plot(fitted(mod5),mod5$residuals, ylim=c(-20000,20000))
hist(mod5$residuals, breaks = 100)
Ultimately our model is too conservative. Simply put, there are too many unidentified situational variables that would affect the amount of damages filed.