Using a function created by groupmate David Blumenstiel I will fetch, clean and prep the data for this assignment.
As the first step in data exploration I use the skim function form the skimr package. This shows missing data, mean, percentiles and a histogram of the distribution of all of the data fields all in one output.
We can see there are no fields with missing data so we will not have to address this issue
Next we will use the funModeling package to produce a quick correlation table with the target variable to determine if there are any noteworthy features in the model. We can see that there really aren’t any stron correlations. The only correlation over 0.20 absolute value is MVR_PTS.
## TARGET_FLAG TARGET_AMT KIDSDRIV AGE HOMEKIDS
## Min. :0.0000 Min. : 0 0:7180 Min. :16.00 0:5289
## 1st Qu.:0.0000 1st Qu.: 0 1: 636 1st Qu.:39.00 1: 902
## Median :0.0000 Median : 0 2: 279 Median :45.00 2:1118
## Mean :0.2638 Mean : 1504 3: 62 Mean :44.79 3: 674
## 3rd Qu.:1.0000 3rd Qu.: 1036 4: 4 3rd Qu.:51.00 4: 164
## Max. :1.0000 Max. :107586 Max. :81.00 5: 14
## YOJ INCOME PARENT1 HOME_VAL
## Min. : 0.00 Min. : 0 Length:8161 Min. : 0
## 1st Qu.: 9.00 1st Qu.: 27964 Class :character 1st Qu.: 0
## Median :12.00 Median : 54005 Mode :character Median :151957
## Mean :10.53 Mean : 60952 Mean :146062
## 3rd Qu.:13.00 3rd Qu.: 83464 3rd Qu.:233352
## Max. :23.00 Max. :367030 Max. :885282
## MSTATUS SEX EDUCATION JOB
## Length:8161 Length:8161 Length:8161 Length:8161
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## TRAVTIME CAR_USE BLUEBOOK TIF
## Min. : 5.00 Length:8161 Min. : 1500 Min. : 1.000
## 1st Qu.: 22.00 Class :character 1st Qu.: 9280 1st Qu.: 1.000
## Median : 33.00 Mode :character Median :14440 Median : 4.000
## Mean : 33.49 Mean :15710 Mean : 5.351
## 3rd Qu.: 44.00 3rd Qu.:20850 3rd Qu.: 7.000
## Max. :142.00 Max. :69740 Max. :25.000
## CAR_TYPE RED_CAR OLDCLAIM CLM_FREQ
## Length:8161 Length:8161 Min. : 0 0:5009
## Class :character Class :character 1st Qu.: 0 1: 997
## Mode :character Mode :character Median : 0 2:1171
## Mean : 4037 3: 776
## 3rd Qu.: 4636 4: 190
## Max. :57037 5: 18
## REVOKED MVR_PTS CAR_AGE URBANICITY
## Length:8161 Min. : 0.000 Min. : 0.000 Length:8161
## Class :character 1st Qu.: 0.000 1st Qu.: 4.000 Class :character
## Mode :character Median : 1.000 Median : 8.000 Mode :character
## Mean : 1.696 Mean : 8.337
## 3rd Qu.: 3.000 3rd Qu.:12.000
## Max. :13.000 Max. :28.000
| Name | train |
| Number of rows | 8161 |
| Number of columns | 25 |
| _______________________ | |
| Column type frequency: | |
| character | 10 |
| factor | 3 |
| numeric | 12 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| PARENT1 | 0 | 1 | 2 | 3 | 0 | 2 | 0 |
| MSTATUS | 0 | 1 | 3 | 4 | 0 | 2 | 0 |
| SEX | 0 | 1 | 1 | 3 | 0 | 2 | 0 |
| EDUCATION | 0 | 1 | 3 | 13 | 0 | 5 | 0 |
| JOB | 0 | 1 | 6 | 13 | 0 | 10 | 0 |
| CAR_USE | 0 | 1 | 7 | 10 | 0 | 2 | 0 |
| CAR_TYPE | 0 | 1 | 3 | 11 | 0 | 6 | 0 |
| RED_CAR | 0 | 1 | 2 | 3 | 0 | 2 | 0 |
| REVOKED | 0 | 1 | 2 | 3 | 0 | 2 | 0 |
| URBANICITY | 0 | 1 | 19 | 21 | 0 | 2 | 0 |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| KIDSDRIV | 0 | 1 | FALSE | 5 | 0: 7180, 1: 636, 2: 279, 3: 62 |
| HOMEKIDS | 0 | 1 | FALSE | 6 | 0: 5289, 2: 1118, 1: 902, 3: 674 |
| CLM_FREQ | 0 | 1 | FALSE | 6 | 0: 5009, 2: 1171, 1: 997, 3: 776 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| TARGET_FLAG | 0 | 1 | 0.26 | 0.44 | 0 | 0 | 0 | 1 | 1.0 | ▇▁▁▁▃ |
| TARGET_AMT | 0 | 1 | 1504.32 | 4704.03 | 0 | 0 | 0 | 1036 | 107586.1 | ▇▁▁▁▁ |
| AGE | 0 | 1 | 44.79 | 8.62 | 16 | 39 | 45 | 51 | 81.0 | ▁▆▇▂▁ |
| YOJ | 0 | 1 | 10.53 | 4.01 | 0 | 9 | 12 | 13 | 23.0 | ▂▃▇▃▁ |
| INCOME | 0 | 1 | 60952.00 | 47030.19 | 0 | 27964 | 54005 | 83464 | 367030.0 | ▇▃▁▁▁ |
| HOME_VAL | 0 | 1 | 146062.19 | 130426.72 | 0 | 0 | 151957 | 233352 | 885282.0 | ▇▅▁▁▁ |
| TRAVTIME | 0 | 1 | 33.49 | 15.91 | 5 | 22 | 33 | 44 | 142.0 | ▇▇▁▁▁ |
| BLUEBOOK | 0 | 1 | 15709.90 | 8419.73 | 1500 | 9280 | 14440 | 20850 | 69740.0 | ▇▆▁▁▁ |
| TIF | 0 | 1 | 5.35 | 4.15 | 1 | 1 | 4 | 7 | 25.0 | ▇▆▁▁▁ |
| OLDCLAIM | 0 | 1 | 4037.08 | 8777.14 | 0 | 0 | 0 | 4636 | 57037.0 | ▇▁▁▁▁ |
| MVR_PTS | 0 | 1 | 1.70 | 2.15 | 0 | 0 | 1 | 3 | 13.0 | ▇▂▁▁▁ |
| CAR_AGE | 0 | 1 | 8.34 | 5.53 | 0 | 4 | 8 | 12 | 28.0 | ▆▇▃▂▁ |
## Variable TARGET_AMT
## 1 TARGET_AMT 1.00
## 2 MVR_PTS 0.14
## 3 OLDCLAIM 0.07
## 4 TRAVTIME 0.03
## 5 BLUEBOOK 0.00
## 6 YOJ -0.02
## 7 AGE -0.04
## 8 TIF -0.05
## 9 INCOME -0.06
## 10 CAR_AGE -0.06
## 11 HOME_VAL -0.08
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
library(pROC)
library(caret)
train$MVR_CLM<-train$MVR_PTS*as.numeric(train$CLM_FREQ)
train$INC_CAR<-train$INCOME/as.numeric(train$BLUEBOOK)
if (train$AGE <30){train$YOUNG<-1} else {train$YOUNG<-0}
## Warning in if (train$AGE < 30) {: the condition has length > 1 and only the
## first element will be used
if (train$AGE >55){train$RETIRE<-1} else {train$RETIRE<-0}
## Warning in if (train$AGE > 55) {: the condition has length > 1 and only the
## first element will be used
if (train$REVOKED =="Yes"){train$REVOKED<-1} else {train$REVOKED<-0}
## Warning in if (train$REVOKED == "Yes") {: the condition has length > 1 and only
## the first element will be used
train$MVR_REV<-train$REVOKED*train$MVR_PTS
if (train$RED_CAR == "yes" & train$CAR_TYPE == "Sports Car"){train$RED_SPORT<-1} else {train$RED_SPORT<-0}
## Warning in if (train$RED_CAR == "yes" & train$CAR_TYPE == "Sports Car") {: the
## condition has length > 1 and only the first element will be used
#train$RED_CAR <- replace(train$RED_CAR, "no", 0)
train$TARGET_FLAG<-as.factor(train$TARGET_FLAG)
train$KIDSDRIV<-as.numeric(train$KIDSDRIV)
train$HOMEKIDS<-as.numeric(train$HOMEKIDS)
train$CLM_FREQ<-as.numeric(train$CLM_FREQ)
set.seed(1005)
training_partition <- createDataPartition(train$TARGET_FLAG, p=0.75, list = FALSE, times=1)
train <- train[training_partition, ]
test <- train[-training_partition, ]
step <- glm(TARGET_FLAG~KIDSDRIV+AGE+HOMEKIDS+YOJ+INCOME+PARENT1+HOME_VAL+MSTATUS+SEX+EDUCATION+TRAVTIME+CAR_USE+BLUEBOOK+TIF+CAR_TYPE+RED_CAR+OLDCLAIM+CLM_FREQ+REVOKED+MVR_PTS+CAR_AGE+URBANICITY+MVR_CLM+INC_CAR#+YOUNG+RETIRE+MVR_REV+RED_SPORT
, data = train, family="binomial") %>%
stepAIC(trace = FALSE)
summary(step)
##
## Call:
## glm(formula = TARGET_FLAG ~ KIDSDRIV + INCOME + PARENT1 + HOME_VAL +
## MSTATUS + EDUCATION + TRAVTIME + CAR_USE + BLUEBOOK + TIF +
## CAR_TYPE + CLM_FREQ + MVR_PTS + URBANICITY + MVR_CLM, family = "binomial",
## data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.3245 -0.7329 -0.4136 0.7014 3.1564
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.488e+00 2.080e-01 -7.154 8.43e-13 ***
## KIDSDRIV 4.322e-01 6.371e-02 6.783 1.18e-11 ***
## INCOME -4.906e-06 1.105e-06 -4.442 8.92e-06 ***
## PARENT1Yes 4.842e-01 1.076e-01 4.498 6.85e-06 ***
## HOME_VAL -1.113e-06 3.563e-07 -3.122 0.00180 **
## MSTATUSz_No 5.138e-01 8.792e-02 5.844 5.11e-09 ***
## EDUCATIONBachelors -6.173e-01 1.113e-01 -5.548 2.89e-08 ***
## EDUCATIONMasters -6.114e-01 1.255e-01 -4.872 1.11e-06 ***
## EDUCATIONPhD -4.687e-01 1.665e-01 -2.815 0.00488 **
## EDUCATIONz_High School -6.413e-02 1.042e-01 -0.615 0.53846
## TRAVTIME 1.528e-02 2.131e-03 7.168 7.61e-13 ***
## CAR_USEPrivate -8.059e-01 8.343e-02 -9.659 < 2e-16 ***
## BLUEBOOK -1.657e-05 5.311e-06 -3.120 0.00181 **
## TIF -5.982e-02 8.399e-03 -7.123 1.06e-12 ***
## CAR_TYPEPanel Truck 4.066e-01 1.664e-01 2.444 0.01453 *
## CAR_TYPEPickup 5.749e-01 1.126e-01 5.108 3.26e-07 ***
## CAR_TYPESports Car 1.058e+00 1.211e-01 8.737 < 2e-16 ***
## CAR_TYPEVan 6.039e-01 1.369e-01 4.412 1.02e-05 ***
## CAR_TYPEz_SUV 7.965e-01 9.754e-02 8.166 3.19e-16 ***
## CLM_FREQ 2.532e-01 3.849e-02 6.579 4.74e-11 ***
## MVR_PTS 2.391e-01 3.480e-02 6.870 6.42e-12 ***
## URBANICITYz_Highly Rural/ Rural -2.358e+00 1.297e-01 -18.183 < 2e-16 ***
## MVR_CLM -5.358e-02 1.274e-02 -4.205 2.61e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 7064.1 on 6120 degrees of freedom
## Residual deviance: 5593.8 on 6098 degrees of freedom
## AIC: 5639.8
##
## Number of Fisher Scoring iterations: 5
test$predictions<-predict(step, test, type="response")
test$predicted = as.factor(ifelse(test$predictions >= 0.5, 1, 0))
confusionMatrix(test$predicted, test$TARGET_FLAG, positive = '1')
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 1054 244
## 1 86 155
##
## Accuracy : 0.7856
## 95% CI : (0.7642, 0.8058)
## No Information Rate : 0.7407
## P-Value [Acc > NIR] : 2.442e-05
##
## Kappa : 0.3593
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.3885
## Specificity : 0.9246
## Pos Pred Value : 0.6432
## Neg Pred Value : 0.8120
## Prevalence : 0.2593
## Detection Rate : 0.1007
## Detection Prevalence : 0.1566
## Balanced Accuracy : 0.6565
##
## 'Positive' Class : 1
##
proc = roc(test$TARGET_FLAG, test$predictions)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
plot(proc)
print(proc$auc)
## Area under the curve: 0.8067
#strain_amt
stepl <- lm(TARGET_AMT~KIDSDRIV+AGE+HOMEKIDS+YOJ+INCOME+PARENT1+HOME_VAL+MSTATUS+SEX+EDUCATION+TRAVTIME+CAR_USE+BLUEBOOK+TIF+CAR_TYPE+RED_CAR+OLDCLAIM+CLM_FREQ+REVOKED+MVR_PTS+CAR_AGE+URBANICITY+MVR_CLM+INC_CAR#+YOUNG+RETIRE+MVR_REV+RED_SPORT
, data = train) %>%
stepAIC(trace = FALSE)
summary(stepl)
##
## Call:
## lm(formula = TARGET_AMT ~ KIDSDRIV + INCOME + PARENT1 + HOME_VAL +
## MSTATUS + SEX + EDUCATION + TRAVTIME + CAR_USE + BLUEBOOK +
## TIF + CAR_TYPE + CLM_FREQ + MVR_PTS + URBANICITY + MVR_CLM,
## data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5210 -1674 -763 330 83355
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.049e+03 3.578e+02 2.932 0.003378 **
## KIDSDRIV 3.563e+02 1.153e+02 3.092 0.001998 **
## INCOME -5.454e-03 1.864e-03 -2.925 0.003455 **
## PARENT1Yes 6.732e+02 1.996e+02 3.372 0.000750 ***
## HOME_VAL -1.315e-03 6.085e-04 -2.162 0.030669 *
## MSTATUSz_No 4.665e+02 1.512e+02 3.085 0.002047 **
## SEXz_F -3.972e+02 1.795e+02 -2.212 0.026971 *
## EDUCATIONBachelors -5.935e+02 1.925e+02 -3.083 0.002058 **
## EDUCATIONMasters -5.722e+02 2.155e+02 -2.655 0.007954 **
## EDUCATIONPhD -3.272e+02 2.840e+02 -1.152 0.249412
## EDUCATIONz_High School -2.059e+02 1.853e+02 -1.111 0.266633
## TRAVTIME 1.041e+01 3.588e+00 2.901 0.003730 **
## CAR_USEPrivate -7.319e+02 1.457e+02 -5.022 5.25e-07 ***
## BLUEBOOK 2.835e-02 9.563e-03 2.965 0.003043 **
## TIF -4.916e+01 1.360e+01 -3.613 0.000305 ***
## CAR_TYPEPanel Truck 1.039e+02 2.966e+02 0.350 0.726030
## CAR_TYPEPickup 3.107e+02 1.865e+02 1.666 0.095771 .
## CAR_TYPESports Car 1.224e+03 2.428e+02 5.039 4.81e-07 ***
## CAR_TYPEVan 3.292e+02 2.350e+02 1.401 0.161381
## CAR_TYPEz_SUV 8.868e+02 2.003e+02 4.428 9.66e-06 ***
## CLM_FREQ 2.112e+02 7.164e+01 2.949 0.003202 **
## MVR_PTS 2.731e+02 6.198e+01 4.405 1.07e-05 ***
## URBANICITYz_Highly Rural/ Rural -1.596e+03 1.526e+02 -10.453 < 2e-16 ***
## MVR_CLM -5.205e+01 2.355e+01 -2.210 0.027160 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4414 on 6097 degrees of freedom
## Multiple R-squared: 0.06665, Adjusted R-squared: 0.06313
## F-statistic: 18.93 on 23 and 6097 DF, p-value: < 2.2e-16