Overview

In this homework assignment, you will explore, analyze and model a data set containing approximately 8000 records representing a customer at an auto insurance company. Each record has two response variables. The first response variable, TARGET_FLAG, is a 1 or a 0. A “1” means that the person was in a car crash. A zero means that the person was not in a car crash. The second response variable is TARGET_AMT. This value is zero if the person did not crash their car. But if they did crash their car, this number will be a value greater than zero.

Your objective is to build multiple linear regression and binary logistic regression models on the training data to predict the probability that a person will crash their car and also the amount of money it will cost if the person does crash their car. You can only use the variables given to you (or variables that you derive from the variables provided).

1. DATA EXPLORATION

Describe the size and the variables in the insurance training data set. Consider that too much detail will cause a manager to lose interest while too little detail will make the manager consider that you aren’t doing your job.

Data acquisition

train = read.csv("https://raw.githubusercontent.com/miachen410/DATA621/master/HW%234/insurance_training_data.csv")

Data structure

There are 8161 observations and 26 variables in the training dataset.

dim(train)
## [1] 8161   26

We want to get rid of the $ and , in the numerical data and z_ in the categorical data:

currencyconv = function(input) {
  out = sub("\\$", "", input)
  out = as.numeric(sub(",", "", out))
  return(out)
}
# Replace spaces with underscores
underscore = function(input) {
  out = sub(" ", "_", input)
  return(out)
}
train = as.tbl(train) %>% 
  mutate_at(c("INCOME","HOME_VAL","BLUEBOOK","OLDCLAIM"),
            currencyconv) %>% 
  mutate_at(c("EDUCATION","JOB","CAR_TYPE","URBANICITY"),
            underscore) %>% 
  mutate_at(c("EDUCATION","JOB","CAR_TYPE","URBANICITY"),
            as.factor) %>% 
  mutate(TARGET_FLAG = as.factor(TARGET_FLAG))

Let’s look at the data structure again:

summary(train) %>% kable() %>% kable_styling()
INDEX TARGET_FLAG TARGET_AMT KIDSDRIV AGE HOMEKIDS YOJ INCOME PARENT1 HOME_VAL MSTATUS SEX EDUCATION JOB TRAVTIME CAR_USE BLUEBOOK TIF CAR_TYPE RED_CAR OLDCLAIM CLM_FREQ REVOKED MVR_PTS CAR_AGE URBANICITY
Min. : 1 0:6008 Min. : 0 Min. :0.0000 Min. :16.00 Min. :0.0000 Min. : 0.0 Min. : 0 No :7084 Min. : 0 Yes :4894 M :3786 <High_School :1203 z_Blue_Collar:1825 Min. : 5.00 Commercial:3029 Min. : 1500 Min. : 1.000 Minivan :2145 no :5783 Min. : 0 Min. :0.0000 No :7161 Min. : 0.000 Min. :-3.000 Highly_Urban/ Urban :6492
1st Qu.: 2559 1:2153 1st Qu.: 0 1st Qu.:0.0000 1st Qu.:39.00 1st Qu.:0.0000 1st Qu.: 9.0 1st Qu.: 28097 Yes:1077 1st Qu.: 0 z_No:3267 z_F:4375 Bachelors :2242 Clerical :1271 1st Qu.: 22.00 Private :5132 1st Qu.: 9280 1st Qu.: 1.000 Panel_Truck: 676 yes:2378 1st Qu.: 0 1st Qu.:0.0000 Yes:1000 1st Qu.: 0.000 1st Qu.: 1.000 z_Highly_Rural/ Rural:1669
Median : 5133 NA Median : 0 Median :0.0000 Median :45.00 Median :0.0000 Median :11.0 Median : 54028 NA Median :161160 NA NA Masters :1658 Professional :1117 Median : 33.00 NA Median :14440 Median : 4.000 Pickup :1389 NA Median : 0 Median :0.0000 NA Median : 1.000 Median : 8.000 NA
Mean : 5152 NA Mean : 1504 Mean :0.1711 Mean :44.79 Mean :0.7212 Mean :10.5 Mean : 61898 NA Mean :154867 NA NA PhD : 728 Manager : 988 Mean : 33.49 NA Mean :15710 Mean : 5.351 Sports_Car : 907 NA Mean : 4037 Mean :0.7986 NA Mean : 1.696 Mean : 8.328 NA
3rd Qu.: 7745 NA 3rd Qu.: 1036 3rd Qu.:0.0000 3rd Qu.:51.00 3rd Qu.:1.0000 3rd Qu.:13.0 3rd Qu.: 85986 NA 3rd Qu.:238724 NA NA z_High_School:2330 Lawyer : 835 3rd Qu.: 44.00 NA 3rd Qu.:20850 3rd Qu.: 7.000 Van : 750 NA 3rd Qu.: 4636 3rd Qu.:2.0000 NA 3rd Qu.: 3.000 3rd Qu.:12.000 NA
Max. :10302 NA Max. :107586 Max. :4.0000 Max. :81.00 Max. :5.0000 Max. :23.0 Max. :367030 NA Max. :885282 NA NA NA Student : 712 Max. :142.00 NA Max. :69740 Max. :25.000 z_SUV :2294 NA Max. :57037 Max. :5.0000 NA Max. :13.000 Max. :28.000 NA
NA NA NA NA NA’s :6 NA NA’s :454 NA’s :445 NA NA’s :464 NA NA NA (Other) :1413 NA NA NA NA NA NA NA NA NA NA NA’s :510 NA
sapply(train, function(x) sum(is.na(x))) %>% kable() %>% kable_styling()
x
INDEX 0
TARGET_FLAG 0
TARGET_AMT 0
KIDSDRIV 0
AGE 6
HOMEKIDS 0
YOJ 454
INCOME 445
PARENT1 0
HOME_VAL 464
MSTATUS 0
SEX 0
EDUCATION 0
JOB 0
TRAVTIME 0
CAR_USE 0
BLUEBOOK 0
TIF 0
CAR_TYPE 0
RED_CAR 0
OLDCLAIM 0
CLM_FREQ 0
REVOKED 0
MVR_PTS 0
CAR_AGE 510
URBANICITY 0

Visulization of the data set

Let’s first look at the density plots of the numerical variables to view their shapes and distributions:

ntrain<-select_if(train, is.numeric)
ntrain %>%
  keep(is.numeric) %>%                     # Keep only numeric columns
  gather() %>%                             # Convert to key-value pairs
  ggplot(aes(value)) +                     # Plot the values
    facet_wrap(~ key, scales = "free") +   # In separate panels
    geom_density()  
## Warning: Removed 1879 rows containing non-finite values (stat_density).

2. DATA PREPARATION

Describe how you have transformed the data by changing the original variables or creating new variables. If you did transform the data or create new variables, discuss why you did this.

Missing values

There are 970 rows of data with NA values. We are going to replacing them with their median values.

# impute data for missing values
# use column mean for calculation
train$AGE[is.na(train$AGE)] <- mean(train$AGE, na.rm=TRUE)
train$YOJ[is.na(train$YOJ)] <- mean(train$YOJ, na.rm=TRUE)
train$HOME_VAL[is.na(train$HOME_VAL)] <- mean(train$HOME_VAL, na.rm=TRUE)
train$CAR_AGE[is.na(train$CAR_AGE)] <- mean(train$CAR_AGE, na.rm=TRUE)
train$INCOME[is.na(train$INCOME)] <- mean(train$INCOME, na.rm=TRUE)
#get complete cases
train <- train[complete.cases(train),]
train2<-train
train <- train[, !(colnames(train) %in% c("INDEX"))]
# 
# #create variable
# train$new <- train$tax / (train$medv*10)
# 
trainnum <- dplyr::select_if(train, is.numeric)
rcorr(as.matrix(trainnum))
##            TARGET_AMT KIDSDRIV   AGE HOMEKIDS   YOJ INCOME HOME_VAL TRAVTIME
## TARGET_AMT       1.00     0.06 -0.04     0.07 -0.02  -0.06    -0.08     0.03
## KIDSDRIV         0.06     1.00 -0.08     0.49  0.05  -0.05    -0.02     0.01
## AGE             -0.04    -0.08  1.00    -0.47  0.13   0.18     0.20     0.01
## HOMEKIDS         0.07     0.49 -0.47     1.00  0.08  -0.16    -0.11    -0.01
## YOJ             -0.02     0.05  0.13     0.08  1.00   0.27     0.26    -0.02
## INCOME          -0.06    -0.05  0.18    -0.16  0.27   1.00     0.54    -0.05
## HOME_VAL        -0.08    -0.02  0.20    -0.11  0.26   0.54     1.00    -0.03
## TRAVTIME         0.03     0.01  0.01    -0.01 -0.02  -0.05    -0.03     1.00
## BLUEBOOK         0.00    -0.02  0.16    -0.11  0.14   0.42     0.25    -0.02
## TIF             -0.05    -0.01  0.00     0.00  0.02  -0.01     0.00    -0.01
## OLDCLAIM         0.13     0.05 -0.04     0.05 -0.02  -0.07    -0.11    -0.01
## CLM_FREQ         0.13     0.04 -0.03     0.04 -0.02  -0.05    -0.10     0.00
## MVR_PTS          0.13     0.05 -0.07     0.06 -0.03  -0.05    -0.07     0.01
## CAR_AGE         -0.06    -0.05  0.17    -0.15  0.06   0.39     0.20    -0.04
##            BLUEBOOK   TIF OLDCLAIM CLM_FREQ MVR_PTS CAR_AGE
## TARGET_AMT     0.00 -0.05     0.13     0.13    0.13   -0.06
## KIDSDRIV      -0.02 -0.01     0.05     0.04    0.05   -0.05
## AGE            0.16  0.00    -0.04    -0.03   -0.07    0.17
## HOMEKIDS      -0.11  0.00     0.05     0.04    0.06   -0.15
## YOJ            0.14  0.02    -0.02    -0.02   -0.03    0.06
## INCOME         0.42 -0.01    -0.07    -0.05   -0.05    0.39
## HOME_VAL       0.25  0.00    -0.11    -0.10   -0.07    0.20
## TRAVTIME      -0.02 -0.01    -0.01     0.00    0.01   -0.04
## BLUEBOOK       1.00 -0.01    -0.04    -0.04   -0.04    0.18
## TIF           -0.01  1.00    -0.03    -0.02   -0.04    0.00
## OLDCLAIM      -0.04 -0.03     1.00     0.93    0.44   -0.02
## CLM_FREQ      -0.04 -0.02     0.93     1.00    0.41   -0.01
## MVR_PTS       -0.04 -0.04     0.44     0.41    1.00   -0.01
## CAR_AGE        0.18  0.00    -0.02    -0.01   -0.01    1.00
## 
## n= 8161 
## 
## 
## P
##            TARGET_AMT KIDSDRIV AGE    HOMEKIDS YOJ    INCOME HOME_VAL TRAVTIME
## TARGET_AMT            0.0000   0.0002 0.0000   0.0585 0.0000 0.0000   0.0115  
## KIDSDRIV   0.0000              0.0000 0.0000   0.0000 0.0000 0.0577   0.5499  
## AGE        0.0002     0.0000          0.0000   0.0000 0.0000 0.0000   0.6342  
## HOMEKIDS   0.0000     0.0000   0.0000          0.0000 0.0000 0.0000   0.4230  
## YOJ        0.0585     0.0000   0.0000 0.0000          0.0000 0.0000   0.1362  
## INCOME     0.0000     0.0000   0.0000 0.0000   0.0000        0.0000   0.0000  
## HOME_VAL   0.0000     0.0577   0.0000 0.0000   0.0000 0.0000          0.0018  
## TRAVTIME   0.0115     0.5499   0.6342 0.4230   0.1362 0.0000 0.0018           
## BLUEBOOK   0.6712     0.0415   0.0000 0.0000   0.0000 0.0000 0.0000   0.1246  
## TIF        0.0000     0.3832   0.9404 0.6725   0.0498 0.4889 0.7280   0.2945  
## OLDCLAIM   0.0000     0.0000   0.0004 0.0000   0.0987 0.0000 0.0000   0.6009  
## CLM_FREQ   0.0000     0.0000   0.0054 0.0002   0.0272 0.0000 0.0000   0.7501  
## MVR_PTS    0.0000     0.0000   0.0000 0.0000   0.0033 0.0000 0.0000   0.5405  
## CAR_AGE    0.0000     0.0000   0.0000 0.0000   0.0000 0.0000 0.0000   0.0009  
##            BLUEBOOK TIF    OLDCLAIM CLM_FREQ MVR_PTS CAR_AGE
## TARGET_AMT 0.6712   0.0000 0.0000   0.0000   0.0000  0.0000 
## KIDSDRIV   0.0415   0.3832 0.0000   0.0000   0.0000  0.0000 
## AGE        0.0000   0.9404 0.0004   0.0054   0.0000  0.0000 
## HOMEKIDS   0.0000   0.6725 0.0000   0.0002   0.0000  0.0000 
## YOJ        0.0000   0.0498 0.0987   0.0272   0.0033  0.0000 
## INCOME     0.0000   0.4889 0.0000   0.0000   0.0000  0.0000 
## HOME_VAL   0.0000   0.7280 0.0000   0.0000   0.0000  0.0000 
## TRAVTIME   0.1246   0.2945 0.6009   0.7501   0.5405  0.0009 
## BLUEBOOK            0.5420 0.0001   0.0003   0.0007  0.0000 
## TIF        0.5420          0.0147   0.0408   0.0006  0.9927 
## OLDCLAIM   0.0001   0.0147          0.0000   0.0000  0.0787 
## CLM_FREQ   0.0003   0.0408 0.0000            0.0000  0.2247 
## MVR_PTS    0.0007   0.0006 0.0000   0.0000           0.4250 
## CAR_AGE    0.0000   0.9927 0.0787   0.2247   0.4250
corrplot(cor(trainnum), method="square")

cor.test(trainnum$HOMEKIDS,trainnum$AGE,method="pearson")
## 
##  Pearson's product-moment correlation
## 
## data:  trainnum$HOMEKIDS and trainnum$AGE
## t = -48.338, df = 8159, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.4885252 -0.4547891
## sample estimates:
##        cor 
## -0.4718298
train2<-train

3. BUILD MODELS

Using the training data set, build at least two different multiple linear regression models and three different binary logistic regression models, using different variables (or the same variables with different transformations). You may select the variables manually, use an approach such as Forward or Stepwise, use a different approach such as trees, or use a combination of techniques. Describe the techniques you used. If you manually selected a variable for inclusion into the model or exclusion into the model, indicate why this was done.

Discuss the coefficients in the models, do they make sense? For example, if a person has a lot of traffic tickets, you would reasonably expect that person to have more car crashes. If the coefficient is negative (suggesting that the person is a safer driver), then that needs to be discussed. Are you keeping the model even though it is counter intuitive? Why? The boss needs to know.

Binary Logistic Regression Models

We first create a full model by including all the variables.

Coefficients (+ or -) of variables with significant p-values:

  • KIDSDRIV (+): When teenagers drive your car, the car is more likely to get into crashes

  • INCOME (-): Rich people are less likely to get into crashes

  • PARENT1/yes (+): Single parent is more likely to get into crashes

  • HOME_VAL (-): Home owners tend to drive more responsibly

  • MSTATUS/yes (-): Married people tend to drive more safely

  • EDUCATION/bachelor, master, phd (-): More educated people tend to drive more safely

  • JOB/blue collar, clerical (+): Blue collar and clerical workers are more likely to get into crashes

  • JOB/manager (-): Manegements tend to drive more safely

  • TRAVTIME (+): Long drives to work suggest greater risk

  • CAR_USE/private (-): Private cars are being driving less than commericial cars, thus the probability of collision is lower

  • BLUEBOOK (-): Unknown effect on probability of collision, but probably effect the payout if there is a crash

  • TIF (-): People who have been customers for a long time are usually more safe

  • CAR_TYPE/panel truck, pickup, sports car, suv, van (+): Sports car has the highest coefficient, more likely to get into a car crash

  • CLM_FREQ (+): The more claims you filed in the past 5 years, the more you are likely to file in the future

  • REVOKED/yes (+): If your license was revoked in the past 7 years, you probably are a more risky driver

  • MVR_PTS (+): If you get lots of traffic tickets, you tend to get into more crashes

  • URBANICITY/highly urban, urban (+): If you live in the city, you are more likely to get into a crash

#MODEL 1
logit <- glm(formula = TARGET_FLAG ~ . - TARGET_AMT, data=train, family = "binomial" (link="logit"))
summary(logit)
## 
## Call:
## glm(formula = TARGET_FLAG ~ . - TARGET_AMT, family = binomial(link = "logit"), 
##     data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.5262  -0.7180  -0.3983   0.6545   3.1455  
## 
## Coefficients:
##                                   Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                     -7.942e-01  3.293e-01  -2.412 0.015880 *  
## KIDSDRIV                         6.821e-01  1.103e-01   6.185 6.21e-10 ***
## AGE                              4.736e-05  4.078e-03   0.012 0.990734    
## HOMEKIDS                         1.513e-01  8.300e-02   1.823 0.068320 .  
## YOJ                             -1.353e-02  8.578e-03  -1.577 0.114756    
## INCOME                          -3.457e-06  1.076e-06  -3.212 0.001317 ** 
## PARENT1Yes                       3.295e-01  1.144e-01   2.881 0.003970 ** 
## HOME_VAL                        -1.323e-06  3.419e-07  -3.871 0.000109 ***
## MSTATUSz_No                      5.146e-01  8.493e-02   6.059 1.37e-09 ***
## SEXz_F                          -8.929e-02  1.120e-01  -0.797 0.425327    
## EDUCATIONBachelors              -3.720e-01  1.154e-01  -3.223 0.001267 ** 
## EDUCATIONMasters                -2.803e-01  1.785e-01  -1.570 0.116405    
## EDUCATIONPhD                    -1.496e-01  2.135e-01  -0.701 0.483401    
## EDUCATIONz_High_School           2.111e-02  9.487e-02   0.222 0.823945    
## JOBClerical                      3.986e-01  1.963e-01   2.030 0.042359 *  
## JOBDoctor                       -4.227e-01  2.662e-01  -1.588 0.112286    
## JOBHome_Maker                    2.049e-01  2.099e-01   0.976 0.328988    
## JOBLawyer                        1.172e-01  1.693e-01   0.692 0.488652    
## JOBManager                      -5.616e-01  1.712e-01  -3.280 0.001038 ** 
## JOBProfessional                  1.673e-01  1.782e-01   0.939 0.347724    
## JOBStudent                       2.038e-01  2.140e-01   0.953 0.340799    
## JOBz_Blue_Collar                 3.101e-01  1.853e-01   1.674 0.094190 .  
## TRAVTIME                         1.483e-02  1.880e-03   7.890 3.02e-15 ***
## CAR_USEPrivate                  -7.604e-01  9.172e-02  -8.291  < 2e-16 ***
## BLUEBOOK                        -2.079e-05  5.255e-06  -3.956 7.63e-05 ***
## TIF                             -3.257e-01  4.138e-02  -7.869 3.56e-15 ***
## CAR_TYPEPanel_Truck              5.701e-01  1.613e-01   3.533 0.000410 ***
## CAR_TYPEPickup                   5.578e-01  1.007e-01   5.540 3.03e-08 ***
## CAR_TYPESports_Car               1.031e+00  1.298e-01   7.942 2.00e-15 ***
## CAR_TYPEVan                      6.158e-01  1.264e-01   4.872 1.10e-06 ***
## CAR_TYPEz_SUV                    7.787e-01  1.111e-01   7.007 2.43e-12 ***
## RED_CARyes                      -5.766e-03  8.631e-02  -0.067 0.946741    
## OLDCLAIM                         6.763e-03  1.697e-02   0.398 0.690300    
## CLM_FREQ                         3.160e-01  1.277e-01   2.474 0.013363 *  
## REVOKEDYes                       7.242e-01  8.184e-02   8.850  < 2e-16 ***
## MVR_PTS                          2.808e-01  4.202e-02   6.682 2.35e-11 ***
## CAR_AGE                         -1.807e-03  7.530e-03  -0.240 0.810372    
## URBANICITYz_Highly_Rural/ Rural -2.371e+00  1.130e-01 -20.989  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 9418.0  on 8160  degrees of freedom
## Residual deviance: 7308.4  on 8123  degrees of freedom
## AIC: 7384.4
## 
## Number of Fisher Scoring iterations: 5
exp(logit$coefficients)
##                     (Intercept)                        KIDSDRIV 
##                      0.45194170                      1.97796637 
##                             AGE                        HOMEKIDS 
##                      1.00004736                      1.16334008 
##                             YOJ                          INCOME 
##                      0.98656280                      0.99999654 
##                      PARENT1Yes                        HOME_VAL 
##                      1.39023185                      0.99999868 
##                     MSTATUSz_No                          SEXz_F 
##                      1.67300058                      0.91457630 
##              EDUCATIONBachelors                EDUCATIONMasters 
##                      0.68933598                      0.75559308 
##                    EDUCATIONPhD          EDUCATIONz_High_School 
##                      0.86104231                      1.02133113 
##                     JOBClerical                       JOBDoctor 
##                      1.48969494                      0.65526320 
##                   JOBHome_Maker                       JOBLawyer 
##                      1.22734502                      1.12438495 
##                      JOBManager                 JOBProfessional 
##                      0.57030294                      1.18213072 
##                      JOBStudent                JOBz_Blue_Collar 
##                      1.22609195                      1.36361825 
##                        TRAVTIME                  CAR_USEPrivate 
##                      1.01494543                      0.46747121 
##                        BLUEBOOK                             TIF 
##                      0.99997921                      0.72204512 
##             CAR_TYPEPanel_Truck                  CAR_TYPEPickup 
##                      1.76835639                      1.74677458 
##              CAR_TYPESports_Car                     CAR_TYPEVan 
##                      2.80272210                      1.85105526 
##                   CAR_TYPEz_SUV                      RED_CARyes 
##                      2.17863803                      0.99425108 
##                        OLDCLAIM                        CLM_FREQ 
##                      1.00678601                      1.37166999 
##                      REVOKEDYes                         MVR_PTS 
##                      2.06310705                      1.32420339 
##                         CAR_AGE URBANICITYz_Highly_Rural/ Rural 
##                      0.99819493                      0.09334636
logitscalar <- mean(dlogis(predict(logit, type = "link")))
logitscalar * coef(logit)
##                     (Intercept)                        KIDSDRIV 
##                   -1.158016e-01                    9.945167e-02 
##                             AGE                        HOMEKIDS 
##                    6.904809e-06                    2.206017e-02 
##                             YOJ                          INCOME 
##                   -1.972543e-03                   -5.040064e-07 
##                      PARENT1Yes                        HOME_VAL 
##                    4.803969e-02                   -1.929523e-07 
##                     MSTATUSz_No                          SEXz_F 
##                    7.503592e-02                   -1.301990e-02 
##              EDUCATIONBachelors                EDUCATIONMasters 
##                   -5.424472e-02                   -4.086324e-02 
##                    EDUCATIONPhD          EDUCATIONz_High_School 
##                   -2.181469e-02                    3.077558e-03 
##                     JOBClerical                       JOBDoctor 
##                    5.811519e-02                   -6.163603e-02 
##                   JOBHome_Maker                       JOBLawyer 
##                    2.986941e-02                    1.709406e-02 
##                      JOBManager                 JOBProfessional 
##                   -8.188439e-02                    2.439650e-02 
##                      JOBStudent                JOBz_Blue_Collar 
##                    2.972047e-02                    4.522137e-02 
##                        TRAVTIME                  CAR_USEPrivate 
##                    2.163050e-03                   -1.108755e-01 
##                        BLUEBOOK                             TIF 
##                   -3.030737e-06                   -4.748519e-02 
##             CAR_TYPEPanel_Truck                  CAR_TYPEPickup 
##                    8.311836e-02                    8.132789e-02 
##              CAR_TYPESports_Car                     CAR_TYPEVan 
##                    1.502692e-01                    8.978260e-02 
##                   CAR_TYPEz_SUV                      RED_CARyes 
##                    1.135413e-01                   -8.406609e-04 
##                        OLDCLAIM                        CLM_FREQ 
##                    9.861172e-04                    4.607979e-02 
##                      REVOKEDYes                         MVR_PTS 
##                    1.055966e-01                    4.094471e-02 
##                         CAR_AGE URBANICITYz_Highly_Rural/ Rural 
##                   -2.634331e-04                   -3.457765e-01
confint.default(logit)
##                                         2.5 %        97.5 %
## (Intercept)                     -1.439652e+00 -1.487526e-01
## KIDSDRIV                         4.659328e-01  8.982056e-01
## AGE                             -7.944409e-03  8.039120e-03
## HOMEKIDS                        -1.137668e-02  3.139672e-01
## YOJ                             -3.034002e-02  3.283437e-03
## INCOME                          -5.565742e-06 -1.347510e-06
## PARENT1Yes                       1.052923e-01  5.536488e-01
## HOME_VAL                        -1.993393e-06 -6.532553e-07
## MSTATUSz_No                      3.481541e-01  6.810835e-01
## SEXz_F                          -3.088261e-01  1.302374e-01
## EDUCATIONBachelors              -5.982429e-01 -1.458101e-01
## EDUCATIONMasters                -6.301049e-01  6.960026e-02
## EDUCATIONPhD                    -5.680126e-01  2.687893e-01
## EDUCATIONz_High_School          -1.648411e-01  2.070547e-01
## JOBClerical                      1.374718e-02  7.833955e-01
## JOBDoctor                       -9.444515e-01  9.901495e-02
## JOBHome_Maker                   -2.064601e-01  6.161667e-01
## JOBLawyer                       -2.145962e-01  4.490685e-01
## JOBManager                      -8.971700e-01 -2.260052e-01
## JOBProfessional                 -1.819185e-01  5.165555e-01
## JOBStudent                      -2.155551e-01  6.232188e-01
## JOBz_Blue_Collar                -5.304572e-02  6.733290e-01
## TRAVTIME                         1.114967e-02  1.852002e-02
## CAR_USEPrivate                  -9.401775e-01 -5.806575e-01
## BLUEBOOK                        -3.108426e-05 -1.048713e-05
## TIF                             -4.067786e-01 -2.445566e-01
## CAR_TYPEPanel_Truck              2.538386e-01  8.862624e-01
## CAR_TYPEPickup                   3.604241e-01  7.551178e-01
## CAR_TYPESports_Car               7.762439e-01  1.284938e+00
## CAR_TYPEVan                      3.680626e-01  8.634492e-01
## CAR_TYPEz_SUV                    5.608898e-01  9.965101e-01
## RED_CARyes                      -1.749296e-01  1.633986e-01
## OLDCLAIM                        -2.650452e-02  4.003069e-02
## CLM_FREQ                         6.565860e-02  5.663993e-01
## REVOKEDYes                       5.638194e-01  8.846068e-01
## MVR_PTS                          1.984459e-01  3.631762e-01
## CAR_AGE                         -1.656445e-02  1.295105e-02
## URBANICITYz_Highly_Rural/ Rural -2.592883e+00 -2.149994e+00
predlogit <- predict(logit, type="response")
train2$pred1 <- predict(logit, type="response")
summary(predlogit)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## 0.002449 0.077438 0.201727 0.263816 0.403524 0.958860
table(true = train$TARGET_FLAG, pred = round(fitted(logit)))
##     pred
## true    0    1
##    0 5532  476
##    1 1251  902
#plots for Model 1
par(mfrow=c(2,2))
plot(logit)

data.frame(train2$pred1) %>%
    ggplot(aes(x = train2.pred1)) + 
    geom_histogram(bins = 50, fill = 'grey50') +
    labs(title = 'Histogram of Predictions') +
    theme_bw()

plot.roc(train$TARGET_FLAG, train2$pred1)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
#extract variables that are significant and rerun model
sigvars <- data.frame(summary(logit)$coef[summary(logit)$coef[,4] <= .05, 4])
sigvars <- add_rownames(sigvars, "vars")
## Warning: Deprecated, use tibble::rownames_to_column() instead.
colist<-dplyr::pull(sigvars, vars)
# colist<-colist[2:11]
colist<-c("KIDSDRIV","INCOME","PARENT1","HOME_VAL","MSTATUS","EDUCATION","JOB","TRAVTIME","CAR_USE","BLUEBOOK","TIF","CAR_TYPE","CLM_FREQ","REVOKED","MVR_PTS","URBANICITY")
idx <- match(colist, names(train))
trainmod2 <- cbind(train[,idx], train2['TARGET_FLAG'])
#MODEL 2
logit2 <- glm(TARGET_FLAG ~ ., data=trainmod2, family = "binomial" (link="logit"))
summary(logit2)
## 
## Call:
## glm(formula = TARGET_FLAG ~ ., family = binomial(link = "logit"), 
##     data = trainmod2)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.5523  -0.7190  -0.3985   0.6497   3.1365  
## 
## Coefficients:
##                                   Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                     -8.728e-01  2.620e-01  -3.332 0.000863 ***
## KIDSDRIV                         7.664e-01  9.775e-02   7.841 4.48e-15 ***
## INCOME                          -3.552e-06  1.071e-06  -3.317 0.000910 ***
## PARENT1Yes                       4.476e-01  9.451e-02   4.736 2.18e-06 ***
## HOME_VAL                        -1.367e-06  3.407e-07  -4.012 6.03e-05 ***
## MSTATUSz_No                      4.766e-01  7.969e-02   5.981 2.22e-09 ***
## EDUCATIONBachelors              -3.839e-01  1.086e-01  -3.534 0.000409 ***
## EDUCATIONMasters                -3.062e-01  1.612e-01  -1.899 0.057514 .  
## EDUCATIONPhD                    -1.761e-01  1.997e-01  -0.882 0.377940    
## EDUCATIONz_High_School           1.682e-02  9.450e-02   0.178 0.858752    
## JOBClerical                      4.011e-01  1.962e-01   2.044 0.040930 *  
## JOBDoctor                       -4.251e-01  2.658e-01  -1.599 0.109770    
## JOBHome_Maker                    2.561e-01  2.038e-01   1.257 0.208790    
## JOBLawyer                        1.091e-01  1.690e-01   0.646 0.518557    
## JOBManager                      -5.704e-01  1.711e-01  -3.335 0.000854 ***
## JOBProfessional                  1.578e-01  1.781e-01   0.886 0.375433    
## JOBStudent                       2.732e-01  2.104e-01   1.299 0.194092    
## JOBz_Blue_Collar                 3.064e-01  1.852e-01   1.654 0.098047 .  
## TRAVTIME                         1.471e-02  1.877e-03   7.837 4.61e-15 ***
## CAR_USEPrivate                  -7.623e-01  9.158e-02  -8.324  < 2e-16 ***
## BLUEBOOK                        -2.321e-05  4.715e-06  -4.922 8.56e-07 ***
## TIF                             -3.257e-01  4.135e-02  -7.875 3.41e-15 ***
## CAR_TYPEPanel_Truck              6.226e-01  1.505e-01   4.137 3.53e-05 ***
## CAR_TYPEPickup                   5.528e-01  1.006e-01   5.497 3.86e-08 ***
## CAR_TYPESports_Car               9.746e-01  1.074e-01   9.077  < 2e-16 ***
## CAR_TYPEVan                      6.466e-01  1.220e-01   5.301 1.15e-07 ***
## CAR_TYPEz_SUV                    7.218e-01  8.585e-02   8.407  < 2e-16 ***
## CLM_FREQ                         3.624e-01  5.464e-02   6.631 3.33e-11 ***
## REVOKEDYes                       7.349e-01  8.022e-02   9.161  < 2e-16 ***
## MVR_PTS                          2.863e-01  4.138e-02   6.920 4.51e-12 ***
## URBANICITYz_Highly_Rural/ Rural -2.373e+00  1.129e-01 -21.024  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 9418.0  on 8160  degrees of freedom
## Residual deviance: 7314.8  on 8130  degrees of freedom
## AIC: 7376.8
## 
## Number of Fisher Scoring iterations: 5
exp(logit2$coefficients)
##                     (Intercept)                        KIDSDRIV 
##                      0.41776744                      2.15209411 
##                          INCOME                      PARENT1Yes 
##                      0.99999645                      1.56455092 
##                        HOME_VAL                     MSTATUSz_No 
##                      0.99999863                      1.61063279 
##              EDUCATIONBachelors                EDUCATIONMasters 
##                      0.68119924                      0.73624403 
##                    EDUCATIONPhD          EDUCATIONz_High_School 
##                      0.83855457                      1.01695952 
##                     JOBClerical                       JOBDoctor 
##                      1.49341342                      0.65370747 
##                   JOBHome_Maker                       JOBLawyer 
##                      1.29192357                      1.11527301 
##                      JOBManager                 JOBProfessional 
##                      0.56527588                      1.17097642 
##                      JOBStudent                JOBz_Blue_Collar 
##                      1.31419704                      1.35848308 
##                        TRAVTIME                  CAR_USEPrivate 
##                      1.01482211                      0.46658415 
##                        BLUEBOOK                             TIF 
##                      0.99997679                      0.72205023 
##             CAR_TYPEPanel_Truck                  CAR_TYPEPickup 
##                      1.86373779                      1.73806537 
##              CAR_TYPESports_Car                     CAR_TYPEVan 
##                      2.65019303                      1.90901065 
##                   CAR_TYPEz_SUV                        CLM_FREQ 
##                      2.05810528                      1.43671851 
##                      REVOKEDYes                         MVR_PTS 
##                      2.08518827                      1.33154581 
## URBANICITYz_Highly_Rural/ Rural 
##                      0.09320986
logit2scalar <- mean(dlogis(predict(logit2, type = "link")))
logit2scalar * coef(logit2)
##                     (Intercept)                        KIDSDRIV 
##                   -1.274002e-01                    1.118714e-01 
##                          INCOME                      PARENT1Yes 
##                   -5.185070e-07                    6.533249e-02 
##                        HOME_VAL                     MSTATUSz_No 
##                   -1.994968e-07                    6.956953e-02 
##              EDUCATIONBachelors                EDUCATIONMasters 
##                   -5.603494e-02                   -4.469269e-02 
##                    EDUCATIONPhD          EDUCATIONz_High_School 
##                   -2.570038e-02                    2.454692e-03 
##                     JOBClerical                       JOBDoctor 
##                    5.854023e-02                   -6.204783e-02 
##                   JOBHome_Maker                       JOBLawyer 
##                    3.738562e-02                    1.592436e-02 
##                      JOBManager                 JOBProfessional 
##                   -8.326286e-02                    2.303837e-02 
##                      JOBStudent                JOBz_Blue_Collar 
##                    3.988064e-02                    4.471824e-02 
##                        TRAVTIME                  CAR_USEPrivate 
##                    2.147590e-03                   -1.112694e-01 
##                        BLUEBOOK                             TIF 
##                   -3.387609e-06                   -4.753412e-02 
##             CAR_TYPEPanel_Truck                  CAR_TYPEPickup 
##                    9.087371e-02                    8.068389e-02 
##              CAR_TYPESports_Car                     CAR_TYPEVan 
##                    1.422595e-01                    9.437697e-02 
##                   CAR_TYPEz_SUV                        CLM_FREQ 
##                    1.053534e-01                    5.289110e-02 
##                      REVOKEDYes                         MVR_PTS 
##                    1.072616e-01                    4.179488e-02 
## URBANICITYz_Highly_Rural/ Rural 
##                   -3.463539e-01
predlogit2 <- predict(logit2, type="response")
train2$pred2 <- predict(logit2, type="response")
summary(predlogit2)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## 0.002282 0.077191 0.202256 0.263816 0.403691 0.961502
table(true = train$TARGET_FLAG, pred = round(fitted(logit2)))
##     pred
## true    0    1
##    0 5541  467
##    1 1247  906
#plots for Model 2
par(mfrow=c(2,2))

plot(logit2)

data.frame(train2$pred2) %>%
    ggplot(aes(x = train2.pred2)) + 
    geom_histogram(bins = 50, fill = 'grey50') +
    labs(title = 'Histogram of Predictions') +
    theme_bw()

plot.roc(train$TARGET_FLAG, train2$pred2)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
#MODEL 3
#PC Model no racial bias
logit3 <- glm(TARGET_FLAG ~ KIDSDRIV + INCOME + HOME_VAL + TRAVTIME, data=train, family = "binomial" (link="logit"))
summary(logit3)
## 
## Call:
## glm(formula = TARGET_FLAG ~ KIDSDRIV + INCOME + HOME_VAL + TRAVTIME, 
##     family = binomial(link = "logit"), data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.5299  -0.8217  -0.6749   1.2315   2.8090  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -6.876e-01  7.305e-02  -9.412  < 2e-16 ***
## KIDSDRIV     7.266e-01  8.115e-02   8.953  < 2e-16 ***
## INCOME      -3.497e-06  6.826e-07  -5.123 3.01e-07 ***
## HOME_VAL    -2.972e-06  2.499e-07 -11.895  < 2e-16 ***
## TRAVTIME     5.880e-03  1.598e-03   3.679 0.000234 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 9418.0  on 8160  degrees of freedom
## Residual deviance: 9021.1  on 8156  degrees of freedom
## AIC: 9031.1
## 
## Number of Fisher Scoring iterations: 4
exp(logit3$coefficients)
## (Intercept)    KIDSDRIV      INCOME    HOME_VAL    TRAVTIME 
##   0.5028055   2.0679778   0.9999965   0.9999970   1.0058969
predlogit3 <- predict(logit3, type="response")
train2$pred3 <- predict(logit3, type="response")
summary(predlogit3)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01176 0.19679 0.25557 0.26382 0.32927 0.68970
table(true = train$TARGET_FLAG, pred = round(fitted(logit3)))
##     pred
## true    0    1
##    0 5937   71
##    1 2086   67
#plots for Model 3
par(mfrow=c(2,2))

plot(logit3)

data.frame(train2$pred3) %>%
    ggplot(aes(x = train2.pred3)) + 
    geom_histogram(bins = 50, fill = 'grey50') +
    labs(title = 'Histogram of Predictions') +
    theme_bw()

plot.roc(train$TARGET_FLAG, train2$pred3)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
logit3scalar <- mean(dlogis(predict(logit3, type = "link")))
logit3scalar * coef(logit3)
##   (Intercept)      KIDSDRIV        INCOME      HOME_VAL      TRAVTIME 
## -1.271908e-01  1.344090e-01 -6.468917e-07 -5.498016e-07  1.087668e-03
round(logitscalar * coef(logit),2)
##                     (Intercept)                        KIDSDRIV 
##                           -0.12                            0.10 
##                             AGE                        HOMEKIDS 
##                            0.00                            0.02 
##                             YOJ                          INCOME 
##                            0.00                            0.00 
##                      PARENT1Yes                        HOME_VAL 
##                            0.05                            0.00 
##                     MSTATUSz_No                          SEXz_F 
##                            0.08                           -0.01 
##              EDUCATIONBachelors                EDUCATIONMasters 
##                           -0.05                           -0.04 
##                    EDUCATIONPhD          EDUCATIONz_High_School 
##                           -0.02                            0.00 
##                     JOBClerical                       JOBDoctor 
##                            0.06                           -0.06 
##                   JOBHome_Maker                       JOBLawyer 
##                            0.03                            0.02 
##                      JOBManager                 JOBProfessional 
##                           -0.08                            0.02 
##                      JOBStudent                JOBz_Blue_Collar 
##                            0.03                            0.05 
##                        TRAVTIME                  CAR_USEPrivate 
##                            0.00                           -0.11 
##                        BLUEBOOK                             TIF 
##                            0.00                           -0.05 
##             CAR_TYPEPanel_Truck                  CAR_TYPEPickup 
##                            0.08                            0.08 
##              CAR_TYPESports_Car                     CAR_TYPEVan 
##                            0.15                            0.09 
##                   CAR_TYPEz_SUV                      RED_CARyes 
##                            0.11                            0.00 
##                        OLDCLAIM                        CLM_FREQ 
##                            0.00                            0.05 
##                      REVOKEDYes                         MVR_PTS 
##                            0.11                            0.04 
##                         CAR_AGE URBANICITYz_Highly_Rural/ Rural 
##                            0.00                           -0.35
round(logit2scalar * coef(logit2),2)
##                     (Intercept)                        KIDSDRIV 
##                           -0.13                            0.11 
##                          INCOME                      PARENT1Yes 
##                            0.00                            0.07 
##                        HOME_VAL                     MSTATUSz_No 
##                            0.00                            0.07 
##              EDUCATIONBachelors                EDUCATIONMasters 
##                           -0.06                           -0.04 
##                    EDUCATIONPhD          EDUCATIONz_High_School 
##                           -0.03                            0.00 
##                     JOBClerical                       JOBDoctor 
##                            0.06                           -0.06 
##                   JOBHome_Maker                       JOBLawyer 
##                            0.04                            0.02 
##                      JOBManager                 JOBProfessional 
##                           -0.08                            0.02 
##                      JOBStudent                JOBz_Blue_Collar 
##                            0.04                            0.04 
##                        TRAVTIME                  CAR_USEPrivate 
##                            0.00                           -0.11 
##                        BLUEBOOK                             TIF 
##                            0.00                           -0.05 
##             CAR_TYPEPanel_Truck                  CAR_TYPEPickup 
##                            0.09                            0.08 
##              CAR_TYPESports_Car                     CAR_TYPEVan 
##                            0.14                            0.09 
##                   CAR_TYPEz_SUV                        CLM_FREQ 
##                            0.11                            0.05 
##                      REVOKEDYes                         MVR_PTS 
##                            0.11                            0.04 
## URBANICITYz_Highly_Rural/ Rural 
##                           -0.35
round(logit3scalar * coef(logit3),2)
## (Intercept)    KIDSDRIV      INCOME    HOME_VAL    TRAVTIME 
##       -0.13        0.13        0.00        0.00        0.00

Build Models GENERAL TARGET_AMT

#MODEL 1
model <- lm(TARGET_AMT ~ ., data=train)
summary(model)
## 
## Call:
## lm(formula = TARGET_AMT ~ ., data = train)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -6234   -465    -58    243 101178 
## 
## Coefficients:
##                                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     -5.975e+02  5.010e+02  -1.193   0.2331    
## TARGET_FLAG1                     5.707e+03  1.134e+02  50.329  < 2e-16 ***
## KIDSDRIV                        -2.216e+01  1.781e+02  -0.124   0.9010    
## AGE                              6.145e+00  6.271e+00   0.980   0.3272    
## HOMEKIDS                         9.215e+01  1.256e+02   0.733   0.4633    
## YOJ                              7.685e+00  1.319e+01   0.583   0.5601    
## INCOME                          -2.258e-03  1.577e-03  -1.431   0.1524    
## PARENT1Yes                       1.209e+02  1.830e+02   0.661   0.5088    
## HOME_VAL                         3.864e-04  5.165e-04   0.748   0.4545    
## MSTATUSz_No                      1.770e+02  1.282e+02   1.381   0.1673    
## SEXz_F                          -2.896e+02  1.606e+02  -1.804   0.0713 .  
## EDUCATIONBachelors               6.823e+01  1.790e+02   0.381   0.7031    
## EDUCATIONMasters                 2.235e+02  2.620e+02   0.853   0.3937    
## EDUCATIONPhD                     4.283e+02  3.110e+02   1.377   0.1685    
## EDUCATIONz_High_School          -1.243e+02  1.502e+02  -0.828   0.4077    
## JOBClerical                     -8.406e+00  2.984e+02  -0.028   0.9775    
## JOBDoctor                       -2.812e+02  3.571e+02  -0.788   0.4310    
## JOBHome_Maker                   -7.045e+01  3.185e+02  -0.221   0.8249    
## JOBLawyer                        7.660e+01  2.582e+02   0.297   0.7667    
## JOBManager                      -1.265e+02  2.521e+02  -0.502   0.6158    
## JOBProfessional                  1.733e+02  2.698e+02   0.642   0.5206    
## JOBStudent                      -1.306e+02  3.266e+02  -0.400   0.6892    
## JOBz_Blue_Collar                 5.187e+01  2.813e+02   0.184   0.8537    
## TRAVTIME                         5.682e-01  2.824e+00   0.201   0.8405    
## CAR_USEPrivate                  -9.993e+01  1.443e+02  -0.693   0.4886    
## BLUEBOOK                         2.944e-02  7.536e-03   3.906 9.45e-05 ***
## TIF                             -1.653e+01  6.277e+01  -0.263   0.7922    
## CAR_TYPEPanel_Truck             -5.880e+01  2.430e+02  -0.242   0.8088    
## CAR_TYPEPickup                  -3.318e+01  1.493e+02  -0.222   0.8241    
## CAR_TYPESports_Car               2.098e+02  1.910e+02   1.099   0.2720    
## CAR_TYPEVan                      9.709e+01  1.865e+02   0.521   0.6026    
## CAR_TYPEz_SUV                    1.621e+02  1.571e+02   1.032   0.3021    
## RED_CARyes                      -2.696e+01  1.302e+02  -0.207   0.8360    
## OLDCLAIM                         4.079e+00  2.908e+01   0.140   0.8884    
## CLM_FREQ                        -8.551e+01  2.210e+02  -0.387   0.6989    
## REVOKEDYes                      -2.991e+02  1.385e+02  -2.160   0.0308 *  
## MVR_PTS                          1.396e+02  6.716e+01   2.079   0.0376 *  
## CAR_AGE                         -2.520e+01  1.118e+01  -2.254   0.0242 *  
## URBANICITYz_Highly_Rural/ Rural  2.987e+01  1.272e+02   0.235   0.8143    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3970 on 8122 degrees of freedom
## Multiple R-squared:  0.2912, Adjusted R-squared:  0.2879 
## F-statistic:  87.8 on 38 and 8122 DF,  p-value: < 2.2e-16
par(mfrow=c(1,2))
plot(model$residuals ~ model$fitted.values)
plot(model$fitted.values,train$TARGET_AMT)

par(mfrow=c(2,2))
plot(model)

#extract variables that are significant and rerun model
sigvars <- data.frame(summary(model)$coef[summary(model)$coef[,4] <= .05, 4])
sigvars <- add_rownames(sigvars, "vars")
## Warning: Deprecated, use tibble::rownames_to_column() instead.
colist<-dplyr::pull(sigvars, vars)
colist<-c("TARGET_FLAG","BLUEBOOK","REVOKED","MVR_PTS","CAR_AGE")
idx <- match(colist, names(train))
trainmod2 <- cbind(train[,idx], train['TARGET_AMT'])
#MODEL 2
model2<-lm(TARGET_AMT ~ ., data=trainmod2)
summary(model2)
## 
## Call:
## lm(formula = TARGET_AMT ~ ., data = trainmod2)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -6269   -378    -34    192 101505 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -4.315e+02  1.206e+02  -3.579 0.000347 ***
## TARGET_FLAG1  5.735e+03  1.036e+02  55.334  < 2e-16 ***
## BLUEBOOK      3.010e-02  5.328e-03   5.649 1.67e-08 ***
## REVOKEDYes   -2.874e+02  1.356e+02  -2.120 0.034021 *  
## MVR_PTS       1.309e+02  6.101e+01   2.145 0.031986 *  
## CAR_AGE      -1.291e+01  8.122e+00  -1.590 0.111894    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3968 on 8155 degrees of freedom
## Multiple R-squared:  0.289,  Adjusted R-squared:  0.2886 
## F-statistic: 662.9 on 5 and 8155 DF,  p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(model2$residuals ~ model2$fitted.values)
plot(model2$fitted.values,train$TARGET_AMT)
par(mfrow=c(2,2))

plot(model2)

par(mfrow=c(1,2))
plot(model2$residuals ~ model2$fitted.values, main="New Reduced Var Model")
abline(h = 0)
plot(model$residuals ~ model$fitted.values, main="Orignal Model All Vars")
abline(h = 0)

#MODEL 3
#remove variables with opposite coefficients
model3<-lm(TARGET_AMT ~ KIDSDRIV + INCOME + HOME_VAL + TRAVTIME, data=train)
summary(model3)
## 
## Call:
## lm(formula = TARGET_AMT ~ KIDSDRIV + INCOME + HOME_VAL + TRAVTIME, 
##     data = train)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -3610  -1652  -1239   -318 106277 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.680e+03  1.470e+02  11.426  < 2e-16 ***
## KIDSDRIV     9.172e+02  1.789e+02   5.126 3.03e-07 ***
## INCOME      -1.242e-03  1.336e-03  -0.930   0.3522    
## HOME_VAL    -2.809e-03  4.920e-04  -5.710 1.17e-08 ***
## TRAVTIME     7.234e+00  3.260e+00   2.219   0.0265 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4679 on 8156 degrees of freedom
## Multiple R-squared:  0.01096,    Adjusted R-squared:  0.01047 
## F-statistic: 22.59 on 4 and 8156 DF,  p-value: < 2.2e-16
par(mfrow=c(1,2))
plot(model3$residuals ~ model3$fitted.values)
plot(model3$fitted.values,train$TARGET_AMT)

par(mfrow=c(2,2))
plot(model3)

4. SELECT MODELS

Decide on the criteria for selecting the best multiple linear regression model and the best binary logistic regression model. Will you select models with slightly worse performance if it makes more sense or is more parsimonious? Discuss why you selected your models.

For the multiple linear regression model, will you use a metric such as Adjusted R2, RMSE, etc.? Be sure to explain how you can make inferences from the model, discuss multi-collinearity issues (if any), and discuss other relevant model output. Using the training data set, evaluate the multiple linear regression model based on (a) mean squared error, (b) R2, (c) F-statistic, and (d) residual plots. For the binary logistic regression model, will you use a metric such as log likelihood, AIC, ROC curve, etc.? Using the training data set, evaluate the binary logistic regression model based on (a) accuracy, (b) classification error rate, (c) precision, (d) sensitivity, (e) specificity, (f) F1 score, (g) AUC, and (h) confusion matrix. Make predictions using the evaluation data set.

test = read.csv("https://raw.githubusercontent.com/miachen410/DATA621/master/HW%234/insurance-evaluation-data.csv")
test2<- test
dim(test)
## [1] 2141   26
test$TARGET_AMT <- 0
test$TARGET_FLAG <- 0
test = as.tbl(test) %>% 
  mutate_at(c("INCOME","HOME_VAL","BLUEBOOK","OLDCLAIM"),
            currencyconv) %>% 
  mutate_at(c("EDUCATION","JOB","CAR_TYPE","URBANICITY"),
            underscore) %>% 
  mutate_at(c("EDUCATION","JOB","CAR_TYPE","URBANICITY"),
            as.factor) %>% 
  mutate(TARGET_FLAG = as.factor(TARGET_FLAG))
# impute data for missing values
# use column mean for calculation
test$HOMEKIDS <- log(test$HOMEKIDS+1)
test$MVR_PTS <- log(test$MVR_PTS+1)
test$OLDCLAIM <- log(test$OLDCLAIM+1)
test$TIF <- log(test$TIF+1)
test$KIDSDRIV <- log(test$KIDSDRIV+1)
test$CLM_FREQ <- log(test$CLM_FREQ+1)
# use column mean for calculation
test$AGE[is.na(test$AGE)] <- mean(test$AGE, na.rm=TRUE)
test$YOJ[is.na(test$YOJ)] <- mean(test$YOJ, na.rm=TRUE)
test$HOME_VAL[is.na(test$HOME_VAL)] <- mean(test$HOME_VAL, na.rm=TRUE)
test$CAR_AGE[is.na(test$CAR_AGE)] <- mean(test$CAR_AGE, na.rm=TRUE)
test$INCOME[is.na(test$INCOME)] <- mean(test$INCOME, na.rm=TRUE)
#get complete cases
#remove rad per correlation in prior section
test <- test[, !(colnames(test) %in% c("INDEX"))]
TARGET_FLAG <- predict(logit, newdata = test, type="response")
y_pred_num <- ifelse(TARGET_FLAG > 0.5, 1, 0)
y_pred <- factor(y_pred_num, levels=c(0, 1))
summary(y_pred)
##    0    1 
## 1776  365
rbind(round(summary(predlogit),4), round(summary(TARGET_FLAG),4)) %>% kable()
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.0024 0.0774 0.2017 0.2638 0.4035 0.9589
0.0031 0.0777 0.2183 0.2708 0.4102 0.9464
test$TARGET_FLAG <- as.factor(test$TARGET_FLAG)
test2 <- test[, !(colnames(test) %in% c("TARGET_FLAG"))]
TARGET_AMT<- predict(model, newdata = test, interval='confidence') #data from scaling originally to get to actual wins
summary(TARGET_AMT)
##       fit                 lwr               upr        
##  Min.   :-1206.170   Min.   :-1870.4   Min.   :-542.0  
##  1st Qu.: -255.615   1st Qu.: -782.6   1st Qu.: 256.4  
##  Median :  -22.708   Median : -538.1   Median : 478.1  
##  Mean   :   -8.173   Mean   : -540.5   Mean   : 524.1  
##  3rd Qu.:  223.762   3rd Qu.: -303.8   3rd Qu.: 774.3  
##  Max.   : 1251.287   Max.   :  521.4   Max.   :1998.7
summary(model)
## 
## Call:
## lm(formula = TARGET_AMT ~ ., data = train)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -6234   -465    -58    243 101178 
## 
## Coefficients:
##                                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     -5.975e+02  5.010e+02  -1.193   0.2331    
## TARGET_FLAG1                     5.707e+03  1.134e+02  50.329  < 2e-16 ***
## KIDSDRIV                        -2.216e+01  1.781e+02  -0.124   0.9010    
## AGE                              6.145e+00  6.271e+00   0.980   0.3272    
## HOMEKIDS                         9.215e+01  1.256e+02   0.733   0.4633    
## YOJ                              7.685e+00  1.319e+01   0.583   0.5601    
## INCOME                          -2.258e-03  1.577e-03  -1.431   0.1524    
## PARENT1Yes                       1.209e+02  1.830e+02   0.661   0.5088    
## HOME_VAL                         3.864e-04  5.165e-04   0.748   0.4545    
## MSTATUSz_No                      1.770e+02  1.282e+02   1.381   0.1673    
## SEXz_F                          -2.896e+02  1.606e+02  -1.804   0.0713 .  
## EDUCATIONBachelors               6.823e+01  1.790e+02   0.381   0.7031    
## EDUCATIONMasters                 2.235e+02  2.620e+02   0.853   0.3937    
## EDUCATIONPhD                     4.283e+02  3.110e+02   1.377   0.1685    
## EDUCATIONz_High_School          -1.243e+02  1.502e+02  -0.828   0.4077    
## JOBClerical                     -8.406e+00  2.984e+02  -0.028   0.9775    
## JOBDoctor                       -2.812e+02  3.571e+02  -0.788   0.4310    
## JOBHome_Maker                   -7.045e+01  3.185e+02  -0.221   0.8249    
## JOBLawyer                        7.660e+01  2.582e+02   0.297   0.7667    
## JOBManager                      -1.265e+02  2.521e+02  -0.502   0.6158    
## JOBProfessional                  1.733e+02  2.698e+02   0.642   0.5206    
## JOBStudent                      -1.306e+02  3.266e+02  -0.400   0.6892    
## JOBz_Blue_Collar                 5.187e+01  2.813e+02   0.184   0.8537    
## TRAVTIME                         5.682e-01  2.824e+00   0.201   0.8405    
## CAR_USEPrivate                  -9.993e+01  1.443e+02  -0.693   0.4886    
## BLUEBOOK                         2.944e-02  7.536e-03   3.906 9.45e-05 ***
## TIF                             -1.653e+01  6.277e+01  -0.263   0.7922    
## CAR_TYPEPanel_Truck             -5.880e+01  2.430e+02  -0.242   0.8088    
## CAR_TYPEPickup                  -3.318e+01  1.493e+02  -0.222   0.8241    
## CAR_TYPESports_Car               2.098e+02  1.910e+02   1.099   0.2720    
## CAR_TYPEVan                      9.709e+01  1.865e+02   0.521   0.6026    
## CAR_TYPEz_SUV                    1.621e+02  1.571e+02   1.032   0.3021    
## RED_CARyes                      -2.696e+01  1.302e+02  -0.207   0.8360    
## OLDCLAIM                         4.079e+00  2.908e+01   0.140   0.8884    
## CLM_FREQ                        -8.551e+01  2.210e+02  -0.387   0.6989    
## REVOKEDYes                      -2.991e+02  1.385e+02  -2.160   0.0308 *  
## MVR_PTS                          1.396e+02  6.716e+01   2.079   0.0376 *  
## CAR_AGE                         -2.520e+01  1.118e+01  -2.254   0.0242 *  
## URBANICITYz_Highly_Rural/ Rural  2.987e+01  1.272e+02   0.235   0.8143    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3970 on 8122 degrees of freedom
## Multiple R-squared:  0.2912, Adjusted R-squared:  0.2879 
## F-statistic:  87.8 on 38 and 8122 DF,  p-value: < 2.2e-16