Data Prep

Using a function created by groupmate David Blumenstiel I will fetch, clean and prep the data for this assignment.

Data Exploration

As the first step in data exploration I use the skim function form the skimr package. This shows missing data, mean, percentiles and a histogram of the distribution of all of the data fields all in one output.

We can see there are no fields with missing data so we will not have to address this issue

Next we will use the funModeling package to produce a quick correlation table with the target variable to determine if there are any noteworthy features in the model. We can see that there really aren’t any stron correlations. The only correlation over 0.20 absolute value is MVR_PTS.

##   TARGET_FLAG       TARGET_AMT     KIDSDRIV      AGE        HOMEKIDS
##  Min.   :0.0000   Min.   :     0   0:7180   Min.   :16.00   0:5289  
##  1st Qu.:0.0000   1st Qu.:     0   1: 636   1st Qu.:39.00   1: 902  
##  Median :0.0000   Median :     0   2: 279   Median :45.00   2:1118  
##  Mean   :0.2638   Mean   :  1504   3:  62   Mean   :44.79   3: 674  
##  3rd Qu.:1.0000   3rd Qu.:  1036   4:   4   3rd Qu.:51.00   4: 164  
##  Max.   :1.0000   Max.   :107586            Max.   :81.00   5:  14  
##       YOJ            INCOME         PARENT1             HOME_VAL     
##  Min.   : 0.00   Min.   :     0   Length:8161        Min.   :     0  
##  1st Qu.: 9.00   1st Qu.: 27964   Class :character   1st Qu.:     0  
##  Median :12.00   Median : 54005   Mode  :character   Median :151957  
##  Mean   :10.53   Mean   : 60952                      Mean   :146062  
##  3rd Qu.:13.00   3rd Qu.: 83464                      3rd Qu.:233352  
##  Max.   :23.00   Max.   :367030                      Max.   :885282  
##    MSTATUS              SEX             EDUCATION             JOB           
##  Length:8161        Length:8161        Length:8161        Length:8161       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##     TRAVTIME        CAR_USE             BLUEBOOK          TIF        
##  Min.   :  5.00   Length:8161        Min.   : 1500   Min.   : 1.000  
##  1st Qu.: 22.00   Class :character   1st Qu.: 9280   1st Qu.: 1.000  
##  Median : 33.00   Mode  :character   Median :14440   Median : 4.000  
##  Mean   : 33.49                      Mean   :15710   Mean   : 5.351  
##  3rd Qu.: 44.00                      3rd Qu.:20850   3rd Qu.: 7.000  
##  Max.   :142.00                      Max.   :69740   Max.   :25.000  
##    CAR_TYPE           RED_CAR             OLDCLAIM     CLM_FREQ
##  Length:8161        Length:8161        Min.   :    0   0:5009  
##  Class :character   Class :character   1st Qu.:    0   1: 997  
##  Mode  :character   Mode  :character   Median :    0   2:1171  
##                                        Mean   : 4037   3: 776  
##                                        3rd Qu.: 4636   4: 190  
##                                        Max.   :57037   5:  18  
##    REVOKED             MVR_PTS          CAR_AGE        URBANICITY       
##  Length:8161        Min.   : 0.000   Min.   : 0.000   Length:8161       
##  Class :character   1st Qu.: 0.000   1st Qu.: 4.000   Class :character  
##  Mode  :character   Median : 1.000   Median : 8.000   Mode  :character  
##                     Mean   : 1.696   Mean   : 8.337                     
##                     3rd Qu.: 3.000   3rd Qu.:12.000                     
##                     Max.   :13.000   Max.   :28.000
Data summary
Name train
Number of rows 8161
Number of columns 25
_______________________
Column type frequency:
character 10
factor 3
numeric 12
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
PARENT1 0 1 2 3 0 2 0
MSTATUS 0 1 3 4 0 2 0
SEX 0 1 1 3 0 2 0
EDUCATION 0 1 3 13 0 5 0
JOB 0 1 6 13 0 10 0
CAR_USE 0 1 7 10 0 2 0
CAR_TYPE 0 1 3 11 0 6 0
RED_CAR 0 1 2 3 0 2 0
REVOKED 0 1 2 3 0 2 0
URBANICITY 0 1 19 21 0 2 0

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
KIDSDRIV 0 1 FALSE 5 0: 7180, 1: 636, 2: 279, 3: 62
HOMEKIDS 0 1 FALSE 6 0: 5289, 2: 1118, 1: 902, 3: 674
CLM_FREQ 0 1 FALSE 6 0: 5009, 2: 1171, 1: 997, 3: 776

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
TARGET_FLAG 0 1 0.26 0.44 0 0 0 1 1.0 ▇▁▁▁▃
TARGET_AMT 0 1 1504.32 4704.03 0 0 0 1036 107586.1 ▇▁▁▁▁
AGE 0 1 44.79 8.62 16 39 45 51 81.0 ▁▆▇▂▁
YOJ 0 1 10.53 4.01 0 9 12 13 23.0 ▂▃▇▃▁
INCOME 0 1 60952.00 47030.19 0 27964 54005 83464 367030.0 ▇▃▁▁▁
HOME_VAL 0 1 146062.19 130426.72 0 0 151957 233352 885282.0 ▇▅▁▁▁
TRAVTIME 0 1 33.49 15.91 5 22 33 44 142.0 ▇▇▁▁▁
BLUEBOOK 0 1 15709.90 8419.73 1500 9280 14440 20850 69740.0 ▇▆▁▁▁
TIF 0 1 5.35 4.15 1 1 4 7 25.0 ▇▆▁▁▁
OLDCLAIM 0 1 4037.08 8777.14 0 0 0 4636 57037.0 ▇▁▁▁▁
MVR_PTS 0 1 1.70 2.15 0 0 1 3 13.0 ▇▂▁▁▁
CAR_AGE 0 1 8.34 5.53 0 4 8 12 28.0 ▆▇▃▂▁
##      Variable TARGET_AMT
## 1  TARGET_AMT       1.00
## 2     MVR_PTS       0.14
## 3    OLDCLAIM       0.07
## 4    TRAVTIME       0.03
## 5    BLUEBOOK       0.00
## 6         YOJ      -0.02
## 7         AGE      -0.04
## 8         TIF      -0.05
## 9      INCOME      -0.06
## 10    CAR_AGE      -0.06
## 11   HOME_VAL      -0.08

Testing

library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
library(pROC)
library(caret)

train$MVR_CLM<-train$MVR_PTS*as.numeric(train$CLM_FREQ)
train$INC_CAR<-train$INCOME/as.numeric(train$BLUEBOOK)
if (train$AGE <30){train$YOUNG<-1} else {train$YOUNG<-0}
## Warning in if (train$AGE < 30) {: the condition has length > 1 and only the
## first element will be used
if (train$AGE >55){train$RETIRE<-1} else {train$RETIRE<-0}
## Warning in if (train$AGE > 55) {: the condition has length > 1 and only the
## first element will be used
if (train$REVOKED =="Yes"){train$REVOKED<-1} else {train$REVOKED<-0}
## Warning in if (train$REVOKED == "Yes") {: the condition has length > 1 and only
## the first element will be used
train$MVR_REV<-train$REVOKED*train$MVR_PTS

if (train$RED_CAR == "yes" & train$CAR_TYPE == "Sports Car"){train$RED_SPORT<-1} else {train$RED_SPORT<-0}
## Warning in if (train$RED_CAR == "yes" & train$CAR_TYPE == "Sports Car") {: the
## condition has length > 1 and only the first element will be used
#train$RED_CAR <- replace(train$RED_CAR, "no", 0)

train$TARGET_FLAG<-as.factor(train$TARGET_FLAG)

train$KIDSDRIV<-as.numeric(train$KIDSDRIV)
train$HOMEKIDS<-as.numeric(train$HOMEKIDS)
train$CLM_FREQ<-as.numeric(train$CLM_FREQ)

set.seed(1005)
training_partition <- createDataPartition(train$TARGET_FLAG, p=0.75, list = FALSE, times=1)
train <- train[training_partition, ]
test <- train[-training_partition, ]


step <- glm(TARGET_FLAG~KIDSDRIV+AGE+HOMEKIDS+YOJ+INCOME+PARENT1+HOME_VAL+MSTATUS+SEX+EDUCATION+TRAVTIME+CAR_USE+BLUEBOOK+TIF+CAR_TYPE+RED_CAR+OLDCLAIM+CLM_FREQ+REVOKED+MVR_PTS+CAR_AGE+URBANICITY+MVR_CLM+INC_CAR#+YOUNG+RETIRE+MVR_REV+RED_SPORT
            , data = train, family="binomial") %>%
  stepAIC(trace = FALSE)
summary(step)
## 
## Call:
## glm(formula = TARGET_FLAG ~ KIDSDRIV + INCOME + PARENT1 + HOME_VAL + 
##     MSTATUS + EDUCATION + TRAVTIME + CAR_USE + BLUEBOOK + TIF + 
##     CAR_TYPE + CLM_FREQ + MVR_PTS + URBANICITY + MVR_CLM, family = "binomial", 
##     data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.3245  -0.7329  -0.4136   0.7014   3.1564  
## 
## Coefficients:
##                                   Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                     -1.488e+00  2.080e-01  -7.154 8.43e-13 ***
## KIDSDRIV                         4.322e-01  6.371e-02   6.783 1.18e-11 ***
## INCOME                          -4.906e-06  1.105e-06  -4.442 8.92e-06 ***
## PARENT1Yes                       4.842e-01  1.076e-01   4.498 6.85e-06 ***
## HOME_VAL                        -1.113e-06  3.563e-07  -3.122  0.00180 ** 
## MSTATUSz_No                      5.138e-01  8.792e-02   5.844 5.11e-09 ***
## EDUCATIONBachelors              -6.173e-01  1.113e-01  -5.548 2.89e-08 ***
## EDUCATIONMasters                -6.114e-01  1.255e-01  -4.872 1.11e-06 ***
## EDUCATIONPhD                    -4.687e-01  1.665e-01  -2.815  0.00488 ** 
## EDUCATIONz_High School          -6.413e-02  1.042e-01  -0.615  0.53846    
## TRAVTIME                         1.528e-02  2.131e-03   7.168 7.61e-13 ***
## CAR_USEPrivate                  -8.059e-01  8.343e-02  -9.659  < 2e-16 ***
## BLUEBOOK                        -1.657e-05  5.311e-06  -3.120  0.00181 ** 
## TIF                             -5.982e-02  8.399e-03  -7.123 1.06e-12 ***
## CAR_TYPEPanel Truck              4.066e-01  1.664e-01   2.444  0.01453 *  
## CAR_TYPEPickup                   5.749e-01  1.126e-01   5.108 3.26e-07 ***
## CAR_TYPESports Car               1.058e+00  1.211e-01   8.737  < 2e-16 ***
## CAR_TYPEVan                      6.039e-01  1.369e-01   4.412 1.02e-05 ***
## CAR_TYPEz_SUV                    7.965e-01  9.754e-02   8.166 3.19e-16 ***
## CLM_FREQ                         2.532e-01  3.849e-02   6.579 4.74e-11 ***
## MVR_PTS                          2.391e-01  3.480e-02   6.870 6.42e-12 ***
## URBANICITYz_Highly Rural/ Rural -2.358e+00  1.297e-01 -18.183  < 2e-16 ***
## MVR_CLM                         -5.358e-02  1.274e-02  -4.205 2.61e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 7064.1  on 6120  degrees of freedom
## Residual deviance: 5593.8  on 6098  degrees of freedom
## AIC: 5639.8
## 
## Number of Fisher Scoring iterations: 5
test$predictions<-predict(step, test, type="response")
test$predicted =  as.factor(ifelse(test$predictions >= 0.5, 1, 0))

confusionMatrix(test$predicted, test$TARGET_FLAG, positive = '1')
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1054  244
##          1   86  155
##                                           
##                Accuracy : 0.7856          
##                  95% CI : (0.7642, 0.8058)
##     No Information Rate : 0.7407          
##     P-Value [Acc > NIR] : 2.442e-05       
##                                           
##                   Kappa : 0.3593          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.3885          
##             Specificity : 0.9246          
##          Pos Pred Value : 0.6432          
##          Neg Pred Value : 0.8120          
##              Prevalence : 0.2593          
##          Detection Rate : 0.1007          
##    Detection Prevalence : 0.1566          
##       Balanced Accuracy : 0.6565          
##                                           
##        'Positive' Class : 1               
## 
proc = roc(test$TARGET_FLAG, test$predictions)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
plot(proc)

print(proc$auc)
## Area under the curve: 0.8067
#strain_amt

stepl <- lm(TARGET_AMT~KIDSDRIV+AGE+HOMEKIDS+YOJ+INCOME+PARENT1+HOME_VAL+MSTATUS+SEX+EDUCATION+TRAVTIME+CAR_USE+BLUEBOOK+TIF+CAR_TYPE+RED_CAR+OLDCLAIM+CLM_FREQ+REVOKED+MVR_PTS+CAR_AGE+URBANICITY+MVR_CLM+INC_CAR#+YOUNG+RETIRE+MVR_REV+RED_SPORT
            , data = train) %>%
  stepAIC(trace = FALSE)
summary(stepl)
## 
## Call:
## lm(formula = TARGET_AMT ~ KIDSDRIV + INCOME + PARENT1 + HOME_VAL + 
##     MSTATUS + SEX + EDUCATION + TRAVTIME + CAR_USE + BLUEBOOK + 
##     TIF + CAR_TYPE + CLM_FREQ + MVR_PTS + URBANICITY + MVR_CLM, 
##     data = train)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -5210  -1674   -763    330  83355 
## 
## Coefficients:
##                                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                      1.049e+03  3.578e+02   2.932 0.003378 ** 
## KIDSDRIV                         3.563e+02  1.153e+02   3.092 0.001998 ** 
## INCOME                          -5.454e-03  1.864e-03  -2.925 0.003455 ** 
## PARENT1Yes                       6.732e+02  1.996e+02   3.372 0.000750 ***
## HOME_VAL                        -1.315e-03  6.085e-04  -2.162 0.030669 *  
## MSTATUSz_No                      4.665e+02  1.512e+02   3.085 0.002047 ** 
## SEXz_F                          -3.972e+02  1.795e+02  -2.212 0.026971 *  
## EDUCATIONBachelors              -5.935e+02  1.925e+02  -3.083 0.002058 ** 
## EDUCATIONMasters                -5.722e+02  2.155e+02  -2.655 0.007954 ** 
## EDUCATIONPhD                    -3.272e+02  2.840e+02  -1.152 0.249412    
## EDUCATIONz_High School          -2.059e+02  1.853e+02  -1.111 0.266633    
## TRAVTIME                         1.041e+01  3.588e+00   2.901 0.003730 ** 
## CAR_USEPrivate                  -7.319e+02  1.457e+02  -5.022 5.25e-07 ***
## BLUEBOOK                         2.835e-02  9.563e-03   2.965 0.003043 ** 
## TIF                             -4.916e+01  1.360e+01  -3.613 0.000305 ***
## CAR_TYPEPanel Truck              1.039e+02  2.966e+02   0.350 0.726030    
## CAR_TYPEPickup                   3.107e+02  1.865e+02   1.666 0.095771 .  
## CAR_TYPESports Car               1.224e+03  2.428e+02   5.039 4.81e-07 ***
## CAR_TYPEVan                      3.292e+02  2.350e+02   1.401 0.161381    
## CAR_TYPEz_SUV                    8.868e+02  2.003e+02   4.428 9.66e-06 ***
## CLM_FREQ                         2.112e+02  7.164e+01   2.949 0.003202 ** 
## MVR_PTS                          2.731e+02  6.198e+01   4.405 1.07e-05 ***
## URBANICITYz_Highly Rural/ Rural -1.596e+03  1.526e+02 -10.453  < 2e-16 ***
## MVR_CLM                         -5.205e+01  2.355e+01  -2.210 0.027160 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4414 on 6097 degrees of freedom
## Multiple R-squared:  0.06665,    Adjusted R-squared:  0.06313 
## F-statistic: 18.93 on 23 and 6097 DF,  p-value: < 2.2e-16