Your objective is to build multiple linear regression and binary logistic regression models on the training data to predict the probability that a person will crash their car and also the amount of money it will cost if the person does crash their car.

Overview In this homework assignment, you will explore, analyze and model a data set containing approximately 8000 records representing a customer at an auto insurance company. Each record has two response variables. The first response variable, TARGET_FLAG, is a 1 or a 0. A “1” means that the person was in a car crash. A zero means that the person was not in a car crash. The second response variable is TARGET_AMT. This value is zero if the person did not crash their car. But if they did crash their car, this number will be a value greater than zero. Your objective is to build multiple linear regression and binary logistic regression models on the training data to predict the probability that a person will crash their car and also the amount of money it will cost if the person does crash their car. You can only use the variables given to you (or variables that you derive from the variables provided). Below is a short description of the variables of interest in the data set:

  1. DATA EXPLORATION (25 Points)

Describe the size and the variables in the insurance training data set. Consider that too much detail will cause a manager to lose interest while too little detail will make the manager consider that you aren’t doing your job. Some suggestions are given below. Please do NOT treat this as a check list of things to do to complete the assignment. You should have your own thoughts on what to tell the boss. These are just ideas. a. Mean / Standard Deviation / Median b. Bar Chart or Box Plot of the data c. Is the data correlated to the target variable (or to other variables?) d. Are any of the variables missing and need to be imputed “fixed”?

train <- read.csv("https://raw.githubusercontent.com/jjohn81/DATA621_Assignment4/master/insurance_training.csv")

check the dimension of the dataset

dim(train)
## [1] 8161   26

Databases were loaded from github repository: https://github.com/jjohn81/DATA621_Assignment4 In train dataset, there were 8161 observers and 26 variables. In evaluation dataset, there were 2141 observers and 26 variables.

Summarize dataset into median, quantile, min, max

summary(train)
##      INDEX        TARGET_FLAG       TARGET_AMT        KIDSDRIV     
##  Min.   :    1   Min.   :0.0000   Min.   :     0   Min.   :0.0000  
##  1st Qu.: 2559   1st Qu.:0.0000   1st Qu.:     0   1st Qu.:0.0000  
##  Median : 5133   Median :0.0000   Median :     0   Median :0.0000  
##  Mean   : 5152   Mean   :0.2638   Mean   :  1504   Mean   :0.1711  
##  3rd Qu.: 7745   3rd Qu.:1.0000   3rd Qu.:  1036   3rd Qu.:0.0000  
##  Max.   :10302   Max.   :1.0000   Max.   :107586   Max.   :4.0000  
##                                                                    
##       AGE           HOMEKIDS           YOJ            INCOME    
##  Min.   :16.00   Min.   :0.0000   Min.   : 0.0   $0      : 615  
##  1st Qu.:39.00   1st Qu.:0.0000   1st Qu.: 9.0           : 445  
##  Median :45.00   Median :0.0000   Median :11.0   $26,840 :   4  
##  Mean   :44.79   Mean   :0.7212   Mean   :10.5   $48,509 :   4  
##  3rd Qu.:51.00   3rd Qu.:1.0000   3rd Qu.:13.0   $61,790 :   4  
##  Max.   :81.00   Max.   :5.0000   Max.   :23.0   $107,375:   3  
##  NA's   :6                        NA's   :454    (Other) :7086  
##  PARENT1        HOME_VAL    MSTATUS      SEX               EDUCATION   
##  No :7084   $0      :2294   Yes :4894   M  :3786   <High School :1203  
##  Yes:1077           : 464   z_No:3267   z_F:4375   Bachelors    :2242  
##             $111,129:   3                          Masters      :1658  
##             $115,249:   3                          PhD          : 728  
##             $123,109:   3                          z_High School:2330  
##             $153,061:   3                                              
##             (Other) :5391                                              
##             JOB          TRAVTIME            CAR_USE        BLUEBOOK   
##  z_Blue Collar:1825   Min.   :  5.00   Commercial:3029   $1,500 : 157  
##  Clerical     :1271   1st Qu.: 22.00   Private   :5132   $6,000 :  34  
##  Professional :1117   Median : 33.00                     $5,800 :  33  
##  Manager      : 988   Mean   : 33.49                     $6,200 :  33  
##  Lawyer       : 835   3rd Qu.: 44.00                     $6,400 :  31  
##  Student      : 712   Max.   :142.00                     $5,900 :  30  
##  (Other)      :1413                                      (Other):7843  
##       TIF                CAR_TYPE    RED_CAR       OLDCLAIM   
##  Min.   : 1.000   Minivan    :2145   no :5783   $0     :5009  
##  1st Qu.: 1.000   Panel Truck: 676   yes:2378   $1,310 :   4  
##  Median : 4.000   Pickup     :1389              $1,391 :   4  
##  Mean   : 5.351   Sports Car : 907              $4,263 :   4  
##  3rd Qu.: 7.000   Van        : 750              $1,105 :   3  
##  Max.   :25.000   z_SUV      :2294              $1,332 :   3  
##                                                 (Other):3134  
##     CLM_FREQ      REVOKED       MVR_PTS          CAR_AGE      
##  Min.   :0.0000   No :7161   Min.   : 0.000   Min.   :-3.000  
##  1st Qu.:0.0000   Yes:1000   1st Qu.: 0.000   1st Qu.: 1.000  
##  Median :0.0000              Median : 1.000   Median : 8.000  
##  Mean   :0.7986              Mean   : 1.696   Mean   : 8.328  
##  3rd Qu.:2.0000              3rd Qu.: 3.000   3rd Qu.:12.000  
##  Max.   :5.0000              Max.   :13.000   Max.   :28.000  
##                                               NA's   :510     
##                  URBANICITY  
##  Highly Urban/ Urban  :6492  
##  z_Highly Rural/ Rural:1669  
##                              
##                              
##                              
##                              
## 

Check the structure of dataset

str(train)
## 'data.frame':    8161 obs. of  26 variables:
##  $ INDEX      : int  1 2 4 5 6 7 8 11 12 13 ...
##  $ TARGET_FLAG: int  0 0 0 0 0 1 0 1 1 0 ...
##  $ TARGET_AMT : num  0 0 0 0 0 ...
##  $ KIDSDRIV   : int  0 0 0 0 0 0 0 1 0 0 ...
##  $ AGE        : int  60 43 35 51 50 34 54 37 34 50 ...
##  $ HOMEKIDS   : int  0 0 1 0 0 1 0 2 0 0 ...
##  $ YOJ        : int  11 11 10 14 NA 12 NA NA 10 7 ...
##  $ INCOME     : Factor w/ 6613 levels "","$0","$1,007",..: 5033 6292 1250 1 509 746 1488 315 4765 282 ...
##  $ PARENT1    : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 2 1 1 1 1 ...
##  $ HOME_VAL   : Factor w/ 5107 levels "","$0","$100,093",..: 2 3259 348 3917 3034 2 1 4167 2 2 ...
##  $ MSTATUS    : Factor w/ 2 levels "Yes","z_No": 2 2 1 1 1 2 1 1 2 2 ...
##  $ SEX        : Factor w/ 2 levels "M","z_F": 1 1 2 1 2 2 2 1 2 1 ...
##  $ EDUCATION  : Factor w/ 5 levels "<High School",..: 4 5 5 1 4 2 1 2 2 2 ...
##  $ JOB        : Factor w/ 9 levels "","Clerical",..: 7 9 2 9 3 9 9 9 2 7 ...
##  $ TRAVTIME   : int  14 22 5 32 36 46 33 44 34 48 ...
##  $ CAR_USE    : Factor w/ 2 levels "Commercial","Private": 2 1 2 2 2 1 2 1 2 1 ...
##  $ BLUEBOOK   : Factor w/ 2789 levels "$1,500","$1,520",..: 434 503 2212 553 802 746 2672 701 135 852 ...
##  $ TIF        : int  11 1 4 7 1 1 1 1 1 7 ...
##  $ CAR_TYPE   : Factor w/ 6 levels "Minivan","Panel Truck",..: 1 1 6 1 6 4 6 5 6 5 ...
##  $ RED_CAR    : Factor w/ 2 levels "no","yes": 2 2 1 2 1 1 1 2 1 1 ...
##  $ OLDCLAIM   : Factor w/ 2857 levels "$0","$1,000",..: 1449 1 1311 1 432 1 1 510 1 1 ...
##  $ CLM_FREQ   : int  2 0 2 0 2 0 0 1 0 0 ...
##  $ REVOKED    : Factor w/ 2 levels "No","Yes": 1 1 1 1 2 1 1 2 1 1 ...
##  $ MVR_PTS    : int  3 0 3 0 3 0 0 10 0 1 ...
##  $ CAR_AGE    : int  18 1 10 6 17 7 1 7 1 17 ...
##  $ URBANICITY : Factor w/ 2 levels "Highly Urban/ Urban",..: 1 1 1 1 1 1 1 1 1 2 ...

visualize missing data

vis_miss(train)

Detailed missing data information

sapply(train, function(x) sum(is.na(x))) %>% kable() %>% kable_styling()
x
INDEX 0
TARGET_FLAG 0
TARGET_AMT 0
KIDSDRIV 0
AGE 6
HOMEKIDS 0
YOJ 454
INCOME 0
PARENT1 0
HOME_VAL 0
MSTATUS 0
SEX 0
EDUCATION 0
JOB 0
TRAVTIME 0
CAR_USE 0
BLUEBOOK 0
TIF 0
CAR_TYPE 0
RED_CAR 0
OLDCLAIM 0
CLM_FREQ 0
REVOKED 0
MVR_PTS 0
CAR_AGE 510
URBANICITY 0

So there are 6 missing data in Age, 454 missing data of YOK, 445 missing data of income, 464 missing data of home value.

Data transformation

#remove "$"
money = function(input) {
  out = sub("\\$", "", input)
  out = as.numeric(sub(",", "", out))
  return(out)
}

# Remove " ", replace with "_"
underscore = function(input) {
  out = sub(" ", "_", input)
  return(out)
}

train = as.tbl(train) %>% 
  mutate_at(c("INCOME","HOME_VAL","BLUEBOOK","OLDCLAIM"),
            money) %>% 
  mutate_at(c("EDUCATION","JOB","CAR_TYPE","URBANICITY"),
            underscore) %>% 
  mutate_at(c("EDUCATION","JOB","CAR_TYPE","URBANICITY"),
            as.factor) %>% 
  mutate(TARGET_FLAG = as.factor(TARGET_FLAG))

1: remove $ sign from INCOME, HOME_VAL, BLUEBOOK, OLDCLAIM; 2: replace " " with underscore “_" of variables : EDUCATION, JOB, CAR_TYPE, URBANICITY 3: change it as factors for above variabls plus TARGET_FLA

`

  1. DATA PREPARATION (25 Points)

Describe how you have transformed the data by changing the original variables or creating new variables. If you did transform the data or create new variables, discuss why you did this. Here are some possible transformations.

  1. Fix missing values (maybe with a Mean or Median value) b. Create flags to suggest if a variable was missing c. Transform data by putting it into buckets d. Mathematical transforms such as log or square root (or use Box-Cox) e. Combine variables (such as ratios or adding or multiplying) to create new variables

Impute missing data with column means

train$AGE[is.na(train$AGE)] <- mean(train$AGE, na.rm=TRUE)
train$YOJ[is.na(train$YOJ)] <- mean(train$YOJ, na.rm=TRUE)
train$HOME_VAL[is.na(train$HOME_VAL)] <- mean(train$HOME_VAL, na.rm=TRUE)
train$CAR_AGE[is.na(train$CAR_AGE)] <- mean(train$CAR_AGE, na.rm=TRUE)
train$INCOME[is.na(train$INCOME)] <- mean(train$INCOME, na.rm=TRUE)
train <- train[complete.cases(train),]
vis_miss(train)

So far, missing data have been fixed.

Save the completed data to train_clean

train_clean <- train

Remove INDEX columns

train_clean$INDEX <- NULL
#Convert indicator variables to 0s and 1s; 1 = Yes, Male for Sex, Commercial for Car Use, Red for RED_CAR, and Highly Urban for URBANICITY
train_clean$PARENT1 <- ifelse(train_clean$PARENT1=="Yes", 1, 0)
train_clean$MSTATUS <- ifelse(train_clean$MSTATUS=="Yes", 1, 0)
train_clean$SEX <- ifelse(train_clean$SEX=="M", 1, 0)
train_clean$CAR_USE <- ifelse(train_clean$CAR_USE=="Commercial", 1, 0)
train_clean$RED_CAR <- ifelse(train_clean$RED_CAR=="Yes", 1, 0)
train_clean$REVOKED <- ifelse(train_clean$REVOKED=="Yes", 1, 0)
train_clean$URBANICITY <- ifelse(train_clean$URBANICITY == "Highly Urban/ Urban", 1, 0)

#Convert categorical predictor values to indicator variables - EDUCATION, CAR_TYPE, JOB

#EDUCATION, High school graduate is base case
train_clean$HSDropout <- ifelse(train_clean$EDUCATION=="<High School", 1, 0)
train_clean$Bachelors <- ifelse(train_clean$EDUCATION=="Bachelors", 1, 0)
train_clean$Masters <- ifelse(train_clean$EDUCATION=="Masters", 1, 0)
train_clean$PhD <- ifelse(train_clean$EDUCATION=="PhD", 1, 0)

#CAR_TYPE, base case is minivan
train_clean$Panel_Truck <- ifelse(train_clean$CAR_TYPE=="Panel Truck", 1, 0)
train_clean$Pickup <- ifelse(train_clean$CAR_TYPE=="Pickup", 1, 0)
train_clean$Sports_Car <- ifelse(train_clean$CAR_TYPE=="Sports Car", 1, 0)
train_clean$Van <- ifelse(train_clean$CAR_TYPE=="Van", 1, 0)
train_clean$SUV <- ifelse(train_clean$CAR_TYPE=="z_SUV", 1, 0)

#JOB, base case is ""
train_clean$Professional <- ifelse(train_clean$JOB == "Professional", 1, 0)
train_clean$Blue_Collar <- ifelse(train_clean$JOB == "Professional", 1, 0)
train_clean$Clerical <- ifelse(train_clean$JOB == "Clerical", 1, 0)
train_clean$Doctor <- ifelse(train_clean$JOB == "Doctor", 1, 0)
train_clean$Lawyer <- ifelse(train_clean$JOB == "Lawyer", 1, 0)
train_clean$Manager <- ifelse(train_clean$JOB == "Manager", 1, 0)
train_clean$Home_Maker <- ifelse(train_clean$JOB == "Home Maker", 1, 0)
train_clean$Student <- ifelse(train_clean$JOB == "Student", 1, 0)

INCOME, HOME_VAL, BLUEBOOK, and OLDCLAIM are represented as strings. So we will be extracting the numeric values for these.

train_clean$INCOME <- as.numeric(train_clean$INCOME)
train_clean$HOME_VAL <- as.numeric(train_clean$HOME_VAL)
train_clean$BLUEBOOK <- as.numeric(train_clean$BLUEBOOK)
train_clean$OLDCLAIM <- as.numeric(train_clean$OLDCLAIM)
str(train_clean)
## Classes 'tbl_df', 'tbl' and 'data.frame':    8161 obs. of  42 variables:
##  $ TARGET_FLAG : Factor w/ 2 levels "0","1": 1 1 1 1 1 2 1 2 2 1 ...
##  $ TARGET_AMT  : num  0 0 0 0 0 ...
##  $ KIDSDRIV    : int  0 0 0 0 0 0 0 1 0 0 ...
##  $ AGE         : num  60 43 35 51 50 34 54 37 34 50 ...
##  $ HOMEKIDS    : int  0 0 1 0 0 1 0 2 0 0 ...
##  $ YOJ         : num  11 11 10 14 10.5 ...
##  $ INCOME      : num  67349 91449 16039 61898 114986 ...
##  $ PARENT1     : num  0 0 0 0 0 1 0 0 0 0 ...
##  $ HOME_VAL    : num  0 257252 124191 306251 243925 ...
##  $ MSTATUS     : num  0 0 1 1 1 0 1 1 0 0 ...
##  $ SEX         : num  1 1 0 1 0 0 0 1 0 1 ...
##  $ EDUCATION   : Factor w/ 5 levels "<High_School",..: 4 5 5 1 4 2 1 2 2 2 ...
##  $ JOB         : Factor w/ 9 levels "","Clerical",..: 7 9 2 9 3 9 9 9 2 7 ...
##  $ TRAVTIME    : int  14 22 5 32 36 46 33 44 34 48 ...
##  $ CAR_USE     : num  0 1 0 0 0 1 0 1 0 1 ...
##  $ BLUEBOOK    : num  14230 14940 4010 15440 18000 ...
##  $ TIF         : int  11 1 4 7 1 1 1 1 1 7 ...
##  $ CAR_TYPE    : Factor w/ 6 levels "Minivan","Panel_Truck",..: 1 1 6 1 6 4 6 5 6 5 ...
##  $ RED_CAR     : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ OLDCLAIM    : num  4461 0 38690 0 19217 ...
##  $ CLM_FREQ    : int  2 0 2 0 2 0 0 1 0 0 ...
##  $ REVOKED     : num  0 0 0 0 1 0 0 1 0 0 ...
##  $ MVR_PTS     : int  3 0 3 0 3 0 0 10 0 1 ...
##  $ CAR_AGE     : num  18 1 10 6 17 7 1 7 1 17 ...
##  $ URBANICITY  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ HSDropout   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Bachelors   : num  0 0 0 0 0 1 0 1 1 1 ...
##  $ Masters     : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ PhD         : num  1 0 0 0 1 0 0 0 0 0 ...
##  $ Panel_Truck : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Pickup      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Sports_Car  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Van         : num  0 0 0 0 0 0 0 1 0 1 ...
##  $ SUV         : num  0 0 1 0 1 0 1 0 1 0 ...
##  $ Professional: num  1 0 0 0 0 0 0 0 0 1 ...
##  $ Blue_Collar : num  1 0 0 0 0 0 0 0 0 1 ...
##  $ Clerical    : num  0 0 1 0 0 0 0 0 1 0 ...
##  $ Doctor      : num  0 0 0 0 1 0 0 0 0 0 ...
##  $ Lawyer      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Manager     : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Home_Maker  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Student     : num  0 0 0 0 0 0 0 0 0 0 ...
write.csv(train_clean, file="train_clean.csv")
getwd()
## [1] "C:/Users/tbao/Desktop/CUNY MSDS notes/DATA 621/HW4"

After this step, we got train_clean.csv file, this file has been transformed to a dataset which is easy to be used for further analysis. So I am going to saved for build a linear model to predict TARGET_AMT

Select numberical variable togeter

ntrain<-select_if(train_clean, is.numeric)

Density plot of variables

ntrain <- as.data.frame((ntrain))

par(mfrow=c(3, 3))
colnames <- dimnames(ntrain)[[2]]

  for(col in 2:ncol(train)) {

    d <- density(na.omit(ntrain[,col]))
   #d <- qqnorm(na.omit(train[,col]))
    plot(d, type="n", main=colnames[col])
    polygon(d, col="blue", border="gray")
  }

From the density plot, we can find INCOME, HOME_VAL, BLUEBOOK and OLDCLAIM are left skewed. We are going to do transformation.

boxcoxfit(train_clean$INCOME[train_clean$INCOME >0])
## Fitted parameters:
##       lambda         beta      sigmasq 
##    0.4436021  291.9569344 8541.2268213 
## 
## Convergence code returned by optim: 0
train_clean$INCOME_MOD <- train_clean$INCOME ^0.443
boxcoxfit(train_clean$HOME_VAL[train_clean$HOME_VAL > 0])
## Fitted parameters:
##     lambda       beta    sigmasq 
##  0.1016119 24.1545983  2.1701992 
## 
## Convergence code returned by optim: 0
train_clean$HOME_VAL_MOD <- train_clean$HOME_VAL^0.102
boxcoxfit(train_clean$BLUEBOOK)
## Fitted parameters:
##       lambda         beta      sigmasq 
##    0.4610754  177.4257712 2217.4825612 
## 
## Convergence code returned by optim: 0
train_clean$BLUEBOOK_MOD <- train_clean$BLUEBOOK^0.461
boxcoxfit(train_clean$OLDCLAIM[train_clean$OLDCLAIM>0])
## Fitted parameters:
##      lambda        beta     sigmasq 
## -0.04511237  7.22517933  0.44041250 
## 
## Convergence code returned by optim: 0
train_clean$OLD_CLAIM_MOD <- log(train_clean$OLDCLAIM + 1)   
str(train_clean)
## Classes 'tbl_df', 'tbl' and 'data.frame':    8161 obs. of  46 variables:
##  $ TARGET_FLAG  : Factor w/ 2 levels "0","1": 1 1 1 1 1 2 1 2 2 1 ...
##  $ TARGET_AMT   : num  0 0 0 0 0 ...
##  $ KIDSDRIV     : int  0 0 0 0 0 0 0 1 0 0 ...
##  $ AGE          : num  60 43 35 51 50 34 54 37 34 50 ...
##  $ HOMEKIDS     : int  0 0 1 0 0 1 0 2 0 0 ...
##  $ YOJ          : num  11 11 10 14 10.5 ...
##  $ INCOME       : num  67349 91449 16039 61898 114986 ...
##  $ PARENT1      : num  0 0 0 0 0 1 0 0 0 0 ...
##  $ HOME_VAL     : num  0 257252 124191 306251 243925 ...
##  $ MSTATUS      : num  0 0 1 1 1 0 1 1 0 0 ...
##  $ SEX          : num  1 1 0 1 0 0 0 1 0 1 ...
##  $ EDUCATION    : Factor w/ 5 levels "<High_School",..: 4 5 5 1 4 2 1 2 2 2 ...
##  $ JOB          : Factor w/ 9 levels "","Clerical",..: 7 9 2 9 3 9 9 9 2 7 ...
##  $ TRAVTIME     : int  14 22 5 32 36 46 33 44 34 48 ...
##  $ CAR_USE      : num  0 1 0 0 0 1 0 1 0 1 ...
##  $ BLUEBOOK     : num  14230 14940 4010 15440 18000 ...
##  $ TIF          : int  11 1 4 7 1 1 1 1 1 7 ...
##  $ CAR_TYPE     : Factor w/ 6 levels "Minivan","Panel_Truck",..: 1 1 6 1 6 4 6 5 6 5 ...
##  $ RED_CAR      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ OLDCLAIM     : num  4461 0 38690 0 19217 ...
##  $ CLM_FREQ     : int  2 0 2 0 2 0 0 1 0 0 ...
##  $ REVOKED      : num  0 0 0 0 1 0 0 1 0 0 ...
##  $ MVR_PTS      : int  3 0 3 0 3 0 0 10 0 1 ...
##  $ CAR_AGE      : num  18 1 10 6 17 7 1 7 1 17 ...
##  $ URBANICITY   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ HSDropout    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Bachelors    : num  0 0 0 0 0 1 0 1 1 1 ...
##  $ Masters      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ PhD          : num  1 0 0 0 1 0 0 0 0 0 ...
##  $ Panel_Truck  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Pickup       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Sports_Car   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Van          : num  0 0 0 0 0 0 0 1 0 1 ...
##  $ SUV          : num  0 0 1 0 1 0 1 0 1 0 ...
##  $ Professional : num  1 0 0 0 0 0 0 0 0 1 ...
##  $ Blue_Collar  : num  1 0 0 0 0 0 0 0 0 1 ...
##  $ Clerical     : num  0 0 1 0 0 0 0 0 1 0 ...
##  $ Doctor       : num  0 0 0 0 1 0 0 0 0 0 ...
##  $ Lawyer       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Manager      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Home_Maker   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Student      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ INCOME_MOD   : num  137.7 157.7 72.9 132.7 174.5 ...
##  $ HOME_VAL_MOD : num  0 3.56 3.31 3.63 3.54 ...
##  $ BLUEBOOK_MOD : num  82.2 84 45.8 85.3 91.6 ...
##  $ OLD_CLAIM_MOD: num  8.4 0 10.56 0 9.86 ...
pairs(~MVR_PTS+CLM_FREQ+URBANICITY+HOME_VAL+PARENT1+CAR_USE+OLDCLAIM, data=train_clean, main="Predictors with High Correlattions to Targets", col="slategrey")

train_mod <- train_clean
write.csv(train_mod, file="train_mod.csv")

Write down a train_mod file with modification of income, home value, old claim and bluebook value.

  1. BUILD MODELS (25 Points)

Using the training data set, build at least two different multiple linear regression models and three different binary logistic regression models, using different variables (or the same variables with different transformations). You may select the variables manually, use an approach such as Forward or Stepwise, use a different approach such as trees, or use a combination of techniques. Describe the techniques you used. If you manually selected a variable for inclusion into the model or exclusion into the model, indicate why this was done.

Discuss the coefficients in the models, do they make sense? For example, if a person has a lot of traffic tickets, you would reasonably expect that person to have more car crashes. If the coefficient is negative (suggesting that the person is a safer driver), then that needs to be discussed. Are you keeping the model even though it is counter intuitive? Why? The boss needs to know.

head(train_clean)
## # A tibble: 6 x 46
##   TARGET_FLAG TARGET_AMT KIDSDRIV   AGE HOMEKIDS   YOJ INCOME PARENT1
##   <fct>            <dbl>    <int> <dbl>    <int> <dbl>  <dbl>   <dbl>
## 1 0                    0        0    60        0  11   6.73e4       0
## 2 0                    0        0    43        0  11   9.14e4       0
## 3 0                    0        0    35        1  10   1.60e4       0
## 4 0                    0        0    51        0  14   6.19e4       0
## 5 0                    0        0    50        0  10.5 1.15e5       0
## 6 1                 2946        0    34        1  12   1.25e5       1
## # ... with 38 more variables: HOME_VAL <dbl>, MSTATUS <dbl>, SEX <dbl>,
## #   EDUCATION <fct>, JOB <fct>, TRAVTIME <int>, CAR_USE <dbl>,
## #   BLUEBOOK <dbl>, TIF <int>, CAR_TYPE <fct>, RED_CAR <dbl>,
## #   OLDCLAIM <dbl>, CLM_FREQ <int>, REVOKED <dbl>, MVR_PTS <int>,
## #   CAR_AGE <dbl>, URBANICITY <dbl>, HSDropout <dbl>, Bachelors <dbl>,
## #   Masters <dbl>, PhD <dbl>, Panel_Truck <dbl>, Pickup <dbl>,
## #   Sports_Car <dbl>, Van <dbl>, SUV <dbl>, Professional <dbl>,
## #   Blue_Collar <dbl>, Clerical <dbl>, Doctor <dbl>, Lawyer <dbl>,
## #   Manager <dbl>, Home_Maker <dbl>, Student <dbl>, INCOME_MOD <dbl>,
## #   HOME_VAL_MOD <dbl>, BLUEBOOK_MOD <dbl>, OLD_CLAIM_MOD <dbl>

First model, we call it original_full_model, to seek the correlation between TARGET_FLAG wiht original viarables but not including the derived variables

train_flag <- train_clean[,-c(2)] 
original_full_model <- glm(TARGET_FLAG ~.-INCOME_MOD-HOME_VAL_MOD-BLUEBOOK_MOD-OLD_CLAIM_MOD, data = train_flag, family = binomial(link='logit'))
summary(original_full_model)
## 
## Call:
## glm(formula = TARGET_FLAG ~ . - INCOME_MOD - HOME_VAL_MOD - BLUEBOOK_MOD - 
##     OLD_CLAIM_MOD, family = binomial(link = "logit"), data = train_flag)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.2562  -0.7497  -0.5050   0.7314   2.6763  
## 
## Coefficients: (19 not defined because of singularities)
##                          Estimate Std. Error z value Pr(>|z|)    
## (Intercept)            -1.308e+00  3.015e-01  -4.338 1.44e-05 ***
## KIDSDRIV                3.016e-01  5.746e-02   5.249 1.53e-07 ***
## AGE                    -3.570e-03  3.843e-03  -0.929 0.352925    
## HOMEKIDS                3.792e-02  3.527e-02   1.075 0.282307    
## YOJ                    -6.714e-03  8.111e-03  -0.828 0.407792    
## INCOME                 -3.076e-06  1.051e-06  -2.927 0.003423 ** 
## PARENT1                 3.208e-01  1.039e-01   3.087 0.002022 ** 
## HOME_VAL               -1.312e-06  3.318e-07  -3.952 7.75e-05 ***
## MSTATUS                -3.933e-01  7.994e-02  -4.920 8.64e-07 ***
## SEX                     9.232e-02  9.697e-02   0.952 0.341098    
## EDUCATIONBachelors     -2.923e-01  1.091e-01  -2.679 0.007385 ** 
## EDUCATIONMasters       -1.709e-01  1.717e-01  -0.995 0.319626    
## EDUCATIONPhD           -8.169e-02  2.062e-01  -0.396 0.691982    
## EDUCATIONz_High_School  2.895e-02  8.851e-02   0.327 0.743612    
## JOBClerical            -2.363e-03  1.914e-01  -0.012 0.990150    
## JOBDoctor              -4.431e-01  2.671e-01  -1.659 0.097109 .  
## JOBHome_Maker          -1.280e-01  2.037e-01  -0.629 0.529619    
## JOBLawyer               3.370e-02  1.672e-01   0.202 0.840275    
## JOBManager             -5.048e-01  1.703e-01  -2.963 0.003044 ** 
## JOBProfessional         2.300e-02  1.757e-01   0.131 0.895878    
## JOBStudent             -1.899e-01  2.080e-01  -0.913 0.361214    
## JOBz_Blue_Collar        1.168e-01  1.816e-01   0.643 0.520368    
## TRAVTIME                6.586e-03  1.733e-03   3.801 0.000144 ***
## CAR_USE                 6.247e-01  8.685e-02   7.192 6.37e-13 ***
## BLUEBOOK               -1.755e-05  5.055e-06  -3.473 0.000515 ***
## TIF                    -5.014e-02  7.046e-03  -7.116 1.11e-12 ***
## CAR_TYPEPanel_Truck     5.257e-01  1.557e-01   3.376 0.000736 ***
## CAR_TYPEPickup          5.150e-01  9.663e-02   5.330 9.84e-08 ***
## CAR_TYPESports_Car      8.987e-01  1.249e-01   7.193 6.36e-13 ***
## CAR_TYPEVan             5.612e-01  1.220e-01   4.598 4.26e-06 ***
## CAR_TYPEz_SUV           6.972e-01  1.073e-01   6.499 8.10e-11 ***
## RED_CAR                        NA         NA      NA       NA    
## OLDCLAIM               -1.327e-05  3.789e-06  -3.502 0.000461 ***
## CLM_FREQ                3.127e-01  2.730e-02  11.456  < 2e-16 ***
## REVOKED                 9.795e-01  8.742e-02  11.204  < 2e-16 ***
## MVR_PTS                 1.354e-01  1.308e-02  10.352  < 2e-16 ***
## CAR_AGE                -1.086e-03  7.232e-03  -0.150 0.880688    
## URBANICITY                     NA         NA      NA       NA    
## HSDropout                      NA         NA      NA       NA    
## Bachelors                      NA         NA      NA       NA    
## Masters                        NA         NA      NA       NA    
## PhD                            NA         NA      NA       NA    
## Panel_Truck                    NA         NA      NA       NA    
## Pickup                         NA         NA      NA       NA    
## Sports_Car                     NA         NA      NA       NA    
## Van                            NA         NA      NA       NA    
## SUV                            NA         NA      NA       NA    
## Professional                   NA         NA      NA       NA    
## Blue_Collar                    NA         NA      NA       NA    
## Clerical                       NA         NA      NA       NA    
## Doctor                         NA         NA      NA       NA    
## Lawyer                         NA         NA      NA       NA    
## Manager                        NA         NA      NA       NA    
## Home_Maker                     NA         NA      NA       NA    
## Student                        NA         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 9418.0  on 8160  degrees of freedom
## Residual deviance: 7946.3  on 8125  degrees of freedom
## AIC: 8018.3
## 
## Number of Fisher Scoring iterations: 5

transform_model: including the transformed variables

train_flag <- train_clean[,-c(2)] 
transform_model <- glm(TARGET_FLAG ~., data = train_flag, family = binomial(link='logit'))
summary(transform_model)
## 
## Call:
## glm(formula = TARGET_FLAG ~ ., family = binomial(link = "logit"), 
##     data = train_flag)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.3309  -0.7444  -0.4913   0.7536   2.7137  
## 
## Coefficients: (19 not defined because of singularities)
##                          Estimate Std. Error z value Pr(>|z|)    
## (Intercept)            -2.061e-01  4.148e-01  -0.497  0.61924    
## KIDSDRIV                3.124e-01  5.794e-02   5.391 7.00e-08 ***
## AGE                    -4.433e-03  3.873e-03  -1.145  0.25236    
## HOMEKIDS                2.162e-02  3.579e-02   0.604  0.54587    
## YOJ                     7.510e-03  9.406e-03   0.798  0.42463    
## INCOME                  2.271e-06  2.356e-06   0.964  0.33513    
## PARENT1                 3.108e-01  1.047e-01   2.968  0.00300 ** 
## HOME_VAL               -5.229e-07  6.301e-07  -0.830  0.40657    
## MSTATUS                -3.816e-01  8.312e-02  -4.591 4.42e-06 ***
## SEX                     1.248e-01  9.772e-02   1.277  0.20159    
## EDUCATIONBachelors     -2.448e-01  1.110e-01  -2.206  0.02735 *  
## EDUCATIONMasters       -1.121e-01  1.733e-01  -0.647  0.51771    
## EDUCATIONPhD           -1.183e-01  2.083e-01  -0.568  0.56994    
## EDUCATIONz_High_School  6.719e-02  8.987e-02   0.748  0.45467    
## JOBClerical             2.629e-02  1.921e-01   0.137  0.89115    
## JOBDoctor              -3.884e-01  2.656e-01  -1.462  0.14362    
## JOBHome_Maker          -3.070e-01  2.158e-01  -1.422  0.15495    
## JOBLawyer               4.006e-02  1.677e-01   0.239  0.81121    
## JOBManager             -4.830e-01  1.702e-01  -2.838  0.00454 ** 
## JOBProfessional         6.366e-02  1.760e-01   0.362  0.71763    
## JOBStudent             -4.582e-01  2.242e-01  -2.044  0.04094 *  
## JOBz_Blue_Collar        1.861e-01  1.824e-01   1.020  0.30757    
## TRAVTIME                7.001e-03  1.747e-03   4.008 6.12e-05 ***
## CAR_USE                 6.170e-01  8.748e-02   7.053 1.75e-12 ***
## BLUEBOOK                3.753e-05  1.900e-05   1.975  0.04825 *  
## TIF                    -4.941e-02  7.078e-03  -6.981 2.93e-12 ***
## CAR_TYPEPanel_Truck     3.919e-01  1.601e-01   2.448  0.01436 *  
## CAR_TYPEPickup          5.215e-01  9.710e-02   5.371 7.84e-08 ***
## CAR_TYPESports_Car      8.933e-01  1.257e-01   7.109 1.17e-12 ***
## CAR_TYPEVan             5.659e-01  1.224e-01   4.625 3.74e-06 ***
## CAR_TYPEz_SUV           7.178e-01  1.085e-01   6.617 3.66e-11 ***
## RED_CAR                        NA         NA      NA       NA    
## OLDCLAIM               -3.373e-05  4.655e-06  -7.245 4.32e-13 ***
## CLM_FREQ                5.348e-02  4.274e-02   1.251  0.21077    
## REVOKED                 1.087e+00  8.931e-02  12.176  < 2e-16 ***
## MVR_PTS                 1.064e-01  1.356e-02   7.847 4.25e-15 ***
## CAR_AGE                -7.413e-04  7.264e-03  -0.102  0.91872    
## URBANICITY                     NA         NA      NA       NA    
## HSDropout                      NA         NA      NA       NA    
## Bachelors                      NA         NA      NA       NA    
## Masters                        NA         NA      NA       NA    
## PhD                            NA         NA      NA       NA    
## Panel_Truck                    NA         NA      NA       NA    
## Pickup                         NA         NA      NA       NA    
## Sports_Car                     NA         NA      NA       NA    
## Van                            NA         NA      NA       NA    
## SUV                            NA         NA      NA       NA    
## Professional                   NA         NA      NA       NA    
## Blue_Collar                    NA         NA      NA       NA    
## Clerical                       NA         NA      NA       NA    
## Doctor                         NA         NA      NA       NA    
## Lawyer                         NA         NA      NA       NA    
## Manager                        NA         NA      NA       NA    
## Home_Maker                     NA         NA      NA       NA    
## Student                        NA         NA      NA       NA    
## INCOME_MOD             -6.794e-03  2.218e-03  -3.063  0.00219 ** 
## HOME_VAL_MOD           -6.199e-02  4.413e-02  -1.405  0.16004    
## BLUEBOOK_MOD           -2.052e-02  6.847e-03  -2.996  0.00273 ** 
## OLD_CLAIM_MOD           1.164e-01  1.459e-02   7.978 1.49e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 9418.0  on 8160  degrees of freedom
## Residual deviance: 7863.3  on 8121  degrees of freedom
## AIC: 7943.3
## 
## Number of Fisher Scoring iterations: 5

reduced_model: only keep the variables with p <0.05

train_flag <- train_clean[,-c(2)] 
reduced_model <- glm(TARGET_FLAG ~.-AGE-HOMEKIDS-YOJ-INCOME-PARENT1-HOME_VAL-MSTATUS-SEX-RED_CAR-CLM_FREQ-CAR_AGE-HSDropout-Professional-Blue_Collar-Clerical-Lawyer-Home_Maker-HOME_VAL_MOD-Student-Doctor-CAR_USE-REVOKED-URBANICITY-Bachelors-Masters-PhD-Panel_Truck-Pickup-Sports_Car-Van-SUV-Manager, data = train_flag, family = binomial(link='logit'))
summary(reduced_model)
## 
## Call:
## glm(formula = TARGET_FLAG ~ . - AGE - HOMEKIDS - YOJ - INCOME - 
##     PARENT1 - HOME_VAL - MSTATUS - SEX - RED_CAR - CLM_FREQ - 
##     CAR_AGE - HSDropout - Professional - Blue_Collar - Clerical - 
##     Lawyer - Home_Maker - HOME_VAL_MOD - Student - Doctor - CAR_USE - 
##     REVOKED - URBANICITY - Bachelors - Masters - PhD - Panel_Truck - 
##     Pickup - Sports_Car - Van - SUV - Manager, family = binomial(link = "logit"), 
##     data = train_flag)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.9313  -0.7729  -0.5421   0.8900   2.4980  
## 
## Coefficients:
##                          Estimate Std. Error z value Pr(>|z|)    
## (Intercept)            -4.193e-01  3.601e-01  -1.164 0.244243    
## KIDSDRIV                3.607e-01  4.850e-02   7.438 1.02e-13 ***
## EDUCATIONBachelors     -1.028e-02  9.662e-02  -0.106 0.915242    
## EDUCATIONMasters        5.759e-02  1.495e-01   0.385 0.700140    
## EDUCATIONPhD            1.153e-01  1.827e-01   0.631 0.527973    
## EDUCATIONz_High_School  2.454e-01  8.257e-02   2.972 0.002958 ** 
## JOBClerical            -1.242e-01  1.855e-01  -0.669 0.503205    
## JOBDoctor              -6.858e-01  2.553e-01  -2.686 0.007230 ** 
## JOBHome_Maker          -5.750e-01  2.034e-01  -2.827 0.004697 ** 
## JOBLawyer              -2.453e-01  1.577e-01  -1.556 0.119767    
## JOBManager             -7.001e-01  1.648e-01  -4.247 2.17e-05 ***
## JOBProfessional        -1.546e-01  1.697e-01  -0.911 0.362176    
## JOBStudent             -2.315e-01  2.096e-01  -1.104 0.269477    
## JOBz_Blue_Collar        3.197e-01  1.760e-01   1.816 0.069394 .  
## TRAVTIME                6.131e-03  1.695e-03   3.617 0.000299 ***
## BLUEBOOK                3.374e-05  1.782e-05   1.893 0.058347 .  
## TIF                    -4.774e-02  6.858e-03  -6.961 3.39e-12 ***
## CAR_TYPEPanel_Truck     8.404e-01  1.359e-01   6.185 6.19e-10 ***
## CAR_TYPEPickup          7.250e-01  8.877e-02   8.167 3.17e-16 ***
## CAR_TYPESports_Car      7.555e-01  1.006e-01   7.512 5.82e-14 ***
## CAR_TYPEVan             8.006e-01  1.114e-01   7.188 6.59e-13 ***
## CAR_TYPEz_SUV           6.330e-01  8.070e-02   7.844 4.35e-15 ***
## OLDCLAIM               -7.858e-06  3.724e-06  -2.110 0.034837 *  
## MVR_PTS                 1.143e-01  1.314e-02   8.699  < 2e-16 ***
## INCOME_MOD             -5.370e-03  8.929e-04  -6.014 1.80e-09 ***
## BLUEBOOK_MOD           -2.003e-02  6.540e-03  -3.063 0.002189 ** 
## OLD_CLAIM_MOD           1.038e-01  8.744e-03  11.868  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 9418  on 8160  degrees of freedom
## Residual deviance: 8263  on 8134  degrees of freedom
## AIC: 8317
## 
## Number of Fisher Scoring iterations: 4
  1. SELECT MODELS (25 Points)

Decide on the criteria for selecting the best multiple linear regression model and the best binary logistic regression model. Will you select models with slightly worse performance if it makes more sense or is more parsimonious? Discuss why you selected your models.

For the multiple linear regression model, will you use a metric such as Adjusted R2, RMSE, etc.? Be sure to explain how you can make inferences from the model, discuss multi-collinearity issues (if any), and discuss other relevant model output. Using the training data set, evaluate the multiple linear regression model based on (a) mean squared error, (b) R2, (c) F-statistic, and (d) residual plots. For the binary logistic regression model, will you use a metric such as log likelihood, AIC, ROC curve, etc.? Using the training data set, evaluate the binary logistic regression model based on (a) accuracy, (b) classification error rate, (c) precision, (d) sensitivity, (e) specificity, (f) F1 score, (g) AUC, and (h) confusion matrix. Make predictions using the evaluation data set.

We would like to select Reduced Model for Binary Logistic Regression models. The AIC and residual deviance for this model seemed to give the best values that would be suited for the prediction. Below is the ROC curve for model5 and to me it looks good. So i would like to proceed with Reduced Model.

train_flag$predict <- predict(reduced_model, train_flag, type='response')

roc_reduced_model <- roc(train_flag$TARGET_FLAG, train_flag$predict, plot=T, asp=NA,
                legacy.axes=T, main = "ROC Curve Reduced Model", col="red")

roc_reduced_model["auc"]
## $auc
## Area under the curve: 0.7364

Build confusion matrix

train_flag$predict_target <- ifelse(train_flag$predict >=0.5, 1, 0)
train_flag$predict_target <- as.integer(train_flag$predict_target)
myvars <- c("TARGET_FLAG", "predict_target")
train_flag_cm <- train_flag[myvars]
cm <- table(train_flag_cm$predict_target,train_flag_cm$TARGET_FLAG)
knitr:: kable(cm)
0 1
0 5635 1614
1 373 539
Accuracy <- function(data) {
tb <- table(train_flag_cm$predict_target,train_flag_cm$TARGET_FLAG)
TN=tb[1,1]
TP=tb[2,2]
FN=tb[2,1]
FP=tb[1,2]
return((TP+TN)/(TP+FP+TN+FN))
}
Accuracy(data)
## [1] 0.7565249
CER <- function(data) {
tb <- table(train_flag_cm$predict_target,train_flag_cm$TARGET_FLAG)
TN=tb[1,1]
TP=tb[2,2]
FN=tb[2,1]
FP=tb[1,2]
return((FP+FN)/(TP+FP+TN+FN))
}
CER(data)
## [1] 0.2434751
Precision <- function(data) {
tb <- table(train_flag_cm$predict_target,train_flag_cm$TARGET_FLAG)
TP=tb[2,2]
FP=tb[1,2]
return((TP)/(TP+FP))
}
Precision(data)
## [1] 0.2503484
Sensitivity <- function(data) {
tb <- table(train_flag_cm$predict_target,train_flag_cm$TARGET_FLAG)
TP=tb[2,2]
FN=tb[2,1]
return((TP)/(TP+FN))
}
Sensitivity(data)
## [1] 0.5910088
Specificity <- function(data) {
tb <- table(train_flag_cm$predict_target,train_flag_cm$TARGET_FLAG)
TN=tb[1,1]
TP=tb[2,2]
FN=tb[2,1]
FP=tb[1,2]
return((TN)/(TN+FP))
}
Specificity(data)
## [1] 0.7773486
F1_score <- function(data) {
tb <- table(train_flag_cm$predict_target,train_flag_cm$TARGET_FLAG)
TN=tb[1,1]
TP=tb[2,2]
FN=tb[2,1]
FP=tb[1,2]
Precision = (TP)/(TP+FP)
Sensitivity = (TP)/(TP+FN)
Precision =(TP)/(TP+FP)
return((2*Precision*Sensitivity)/(Precision+Sensitivity))
}
F1_score(data)
## [1] 0.3517129

Test the reduced model on evaluation data

evaluation <- read.csv("https://raw.githubusercontent.com/jjohn81/DATA621_Assignment4/master/insurance-evaluation.csv")
evaluation2 <- evaluation
dim(evaluation)
## [1] 2141   26
str(evaluation)
## 'data.frame':    2141 obs. of  26 variables:
##  $ INDEX      : int  3 9 10 18 21 30 31 37 39 47 ...
##  $ TARGET_FLAG: logi  NA NA NA NA NA NA ...
##  $ TARGET_AMT : logi  NA NA NA NA NA NA ...
##  $ KIDSDRIV   : int  0 1 0 0 0 0 0 0 2 0 ...
##  $ AGE        : int  48 40 44 35 59 46 60 54 36 50 ...
##  $ HOMEKIDS   : int  0 1 2 2 0 0 0 0 2 0 ...
##  $ YOJ        : int  11 11 12 NA 12 14 12 12 12 8 ...
##  $ INCOME     : Factor w/ 1804 levels "","$0","$1,249",..: 1154 1119 974 513 1686 1 863 766 233 387 ...
##  $ PARENT1    : Factor w/ 2 levels "No","Yes": 1 2 2 2 1 1 1 1 2 1 ...
##  $ HOME_VAL   : Factor w/ 1398 levels "","$0","$100,046",..: 2 2 2 2 2 636 468 297 1161 2 ...
##  $ MSTATUS    : Factor w/ 2 levels "Yes","z_No": 2 2 2 2 2 1 1 1 2 2 ...
##  $ SEX        : Factor w/ 2 levels "M","z_F": 1 1 2 1 1 1 2 1 2 2 ...
##  $ EDUCATION  : Factor w/ 5 levels "<High School",..: 2 5 5 5 5 2 5 1 2 4 ...
##  $ JOB        : Factor w/ 9 levels "","Clerical",..: 6 6 9 2 6 7 9 9 9 3 ...
##  $ TRAVTIME   : int  26 21 30 74 45 7 16 27 5 22 ...
##  $ CAR_USE    : Factor w/ 2 levels "Commercial","Private": 2 2 1 2 2 1 1 1 1 2 ...
##  $ BLUEBOOK   : Factor w/ 1417 levels "$1,500","$1,530",..: 703 540 1189 1373 345 864 95 799 913 1075 ...
##  $ TIF        : int  1 6 10 6 1 1 1 4 4 4 ...
##  $ CAR_TYPE   : Factor w/ 6 levels "Minivan","Panel Truck",..: 5 1 6 3 1 2 4 2 1 4 ...
##  $ RED_CAR    : Factor w/ 2 levels "no","yes": 2 1 1 1 2 1 1 1 1 1 ...
##  $ OLDCLAIM   : Factor w/ 834 levels "$0","$1,001",..: 1 272 1 1 494 137 1 1 1 1 ...
##  $ CLM_FREQ   : int  0 1 0 0 2 1 0 0 0 0 ...
##  $ REVOKED    : Factor w/ 2 levels "No","Yes": 1 1 1 2 1 1 1 1 1 1 ...
##  $ MVR_PTS    : int  2 2 0 0 4 2 0 5 0 3 ...
##  $ CAR_AGE    : int  10 1 10 4 1 12 1 NA 9 1 ...
##  $ URBANICITY : Factor w/ 2 levels "Highly Urban/ Urban",..: 1 1 2 2 1 1 1 1 2 1 ...
evaluation$INDEX <- NULL
evaluation$TARGET_AMT <- 0
evaluation$TARGET_FLAG <- 0
str(evaluation)
## 'data.frame':    2141 obs. of  25 variables:
##  $ TARGET_FLAG: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ TARGET_AMT : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ KIDSDRIV   : int  0 1 0 0 0 0 0 0 2 0 ...
##  $ AGE        : int  48 40 44 35 59 46 60 54 36 50 ...
##  $ HOMEKIDS   : int  0 1 2 2 0 0 0 0 2 0 ...
##  $ YOJ        : int  11 11 12 NA 12 14 12 12 12 8 ...
##  $ INCOME     : Factor w/ 1804 levels "","$0","$1,249",..: 1154 1119 974 513 1686 1 863 766 233 387 ...
##  $ PARENT1    : Factor w/ 2 levels "No","Yes": 1 2 2 2 1 1 1 1 2 1 ...
##  $ HOME_VAL   : Factor w/ 1398 levels "","$0","$100,046",..: 2 2 2 2 2 636 468 297 1161 2 ...
##  $ MSTATUS    : Factor w/ 2 levels "Yes","z_No": 2 2 2 2 2 1 1 1 2 2 ...
##  $ SEX        : Factor w/ 2 levels "M","z_F": 1 1 2 1 1 1 2 1 2 2 ...
##  $ EDUCATION  : Factor w/ 5 levels "<High School",..: 2 5 5 5 5 2 5 1 2 4 ...
##  $ JOB        : Factor w/ 9 levels "","Clerical",..: 6 6 9 2 6 7 9 9 9 3 ...
##  $ TRAVTIME   : int  26 21 30 74 45 7 16 27 5 22 ...
##  $ CAR_USE    : Factor w/ 2 levels "Commercial","Private": 2 2 1 2 2 1 1 1 1 2 ...
##  $ BLUEBOOK   : Factor w/ 1417 levels "$1,500","$1,530",..: 703 540 1189 1373 345 864 95 799 913 1075 ...
##  $ TIF        : int  1 6 10 6 1 1 1 4 4 4 ...
##  $ CAR_TYPE   : Factor w/ 6 levels "Minivan","Panel Truck",..: 5 1 6 3 1 2 4 2 1 4 ...
##  $ RED_CAR    : Factor w/ 2 levels "no","yes": 2 1 1 1 2 1 1 1 1 1 ...
##  $ OLDCLAIM   : Factor w/ 834 levels "$0","$1,001",..: 1 272 1 1 494 137 1 1 1 1 ...
##  $ CLM_FREQ   : int  0 1 0 0 2 1 0 0 0 0 ...
##  $ REVOKED    : Factor w/ 2 levels "No","Yes": 1 1 1 2 1 1 1 1 1 1 ...
##  $ MVR_PTS    : int  2 2 0 0 4 2 0 5 0 3 ...
##  $ CAR_AGE    : int  10 1 10 4 1 12 1 NA 9 1 ...
##  $ URBANICITY : Factor w/ 2 levels "Highly Urban/ Urban",..: 1 1 2 2 1 1 1 1 2 1 ...
#remove "$"
money = function(input) {
  out = sub("\\$", "", input)
  out = as.numeric(sub(",", "", out))
  return(out)
}

# Remove " ", replace with "_"
underscore = function(input) {
  out = sub(" ", "_", input)
  return(out)
}

evaluation = as.tbl(evaluation) %>% 
  mutate_at(c("INCOME","HOME_VAL","BLUEBOOK","OLDCLAIM"),
            money) %>% 
  mutate_at(c("EDUCATION","JOB","CAR_TYPE","URBANICITY"),
            underscore) %>% 
  mutate_at(c("EDUCATION","JOB","CAR_TYPE","URBANICITY"),
            as.factor) %>% 
  mutate(TARGET_FLAG = as.factor(TARGET_FLAG))
evaluation$AGE[is.na(evaluation$AGE)] <- mean(evaluation$AGE, na.rm=TRUE)
evaluation$YOJ[is.na(evaluation$YOJ)] <- mean(evaluation$YOJ, na.rm=TRUE)
evaluation$HOME_VAL[is.na(evaluation$HOME_VAL)] <- mean(evaluation$HOME_VAL, na.rm=TRUE)
evaluation$CAR_AGE[is.na(evaluation$CAR_AGE)] <- mean(evaluation$CAR_AGE, na.rm=TRUE)
evaluation$INCOME[is.na(evaluation$INCOME)] <- mean(evaluation$INCOME, na.rm=TRUE)
evaluation <- evaluation[complete.cases(evaluation),]
evaluation$INCOME <- as.numeric(evaluation$INCOME)
evaluation$HOME_VAL <- as.numeric(evaluation$HOME_VAL)
evaluation$BLUEBOOK <- as.numeric(evaluation$BLUEBOOK)
evaluation$OLDCLAIM <- as.numeric(evaluation$OLDCLAIM)
evaluation$INCOME_MOD <- evaluation$INCOME ^0.433
evaluation$HOME_VAL_MOD <- evaluation$HOME_VAL^0.102
evaluation$BLUEBOOK_MOD <- evaluation$BLUEBOOK^0.461
evaluation$OLD_CLAIM_MOD <- log(evaluation$OLDCLAIM + 1) 
evaluation$PARENT1 <- ifelse(evaluation$PARENT1=="Yes", 1, 0)
evaluation$MSTATUS <- ifelse(evaluation$MSTATUS=="Yes", 1, 0)
evaluation$SEX <- ifelse(evaluation$SEX=="M", 1, 0)
evaluation$CAR_USE <- ifelse(evaluation$CAR_USE=="Commercial", 1, 0)
evaluation$RED_CAR <- ifelse(evaluation$RED_CAR=="Yes", 1, 0)
evaluation$REVOKED <- ifelse(evaluation$REVOKED=="Yes", 1, 0)
evaluation$URBANICITY <- ifelse(evaluation$URBANICITY == "Highly Urban/ Urban", 1, 0)


evaluation$HSDropout <- ifelse(evaluation$EDUCATION=="<High School", 1, 0)
evaluation$Bachelors <- ifelse(evaluation$EDUCATION=="Bachelors", 1, 0)
evaluation$Masters <- ifelse(evaluation$EDUCATION=="Masters", 1, 0)
evaluation$PhD <- ifelse(evaluation$EDUCATION=="PhD", 1, 0)


evaluation$Panel_Truck <- ifelse(evaluation$CAR_TYPE=="Panel Truck", 1, 0)
evaluation$Pickup <- ifelse(evaluation$CAR_TYPE=="Pickup", 1, 0)
evaluation$Sports_Car <- ifelse(evaluation$CAR_TYPE=="Sports Car", 1, 0)
evaluation$Van <- ifelse(evaluation$CAR_TYPE=="Van", 1, 0)
evaluation$SUV <- ifelse(evaluation$CAR_TYPE=="z_SUV", 1, 0)


evaluation$Professional <- ifelse(evaluation$JOB == "Professional", 1, 0)
evaluation$Blue_Collar <- ifelse(evaluation$JOB == "Professional", 1, 0)
evaluation$Clerical <- ifelse(evaluation$JOB == "Clerical", 1, 0)
evaluation$Doctor <- ifelse(evaluation$JOB == "Doctor", 1, 0)
evaluation$Lawyer <- ifelse(evaluation$JOB == "Lawyer", 1, 0)
evaluation$Manager <- ifelse(evaluation$JOB == "Manager", 1, 0)
evaluation$Home_Maker <- ifelse(evaluation$JOB == "Home Maker", 1, 0)
evaluation$Student <- ifelse(evaluation$JOB == "Student", 1, 0)
evaluation_mod <- evaluation
write.csv(evaluation_mod, file="evaluation_mod.csv")

this step is to save the transformed dataset of evluation for TARGET_ATM analysis use.

TARGET_FLAG <- predict(reduced_model, newdata = evaluation, type="response")

y_pred_num <- ifelse(TARGET_FLAG > 0.5, 1, 0)
y_pred <- factor(y_pred_num, levels=c(0, 1))
summary(y_pred)
##    0    1 
## 1880  261