Your objective is to build multiple linear regression and binary logistic regression models on the training data to predict the probability that a person will crash their car and also the amount of money it will cost if the person does crash their car.
Overview In this homework assignment, you will explore, analyze and model a data set containing approximately 8000 records representing a customer at an auto insurance company. Each record has two response variables. The first response variable, TARGET_FLAG, is a 1 or a 0. A “1” means that the person was in a car crash. A zero means that the person was not in a car crash. The second response variable is TARGET_AMT. This value is zero if the person did not crash their car. But if they did crash their car, this number will be a value greater than zero. Your objective is to build multiple linear regression and binary logistic regression models on the training data to predict the probability that a person will crash their car and also the amount of money it will cost if the person does crash their car. You can only use the variables given to you (or variables that you derive from the variables provided). Below is a short description of the variables of interest in the data set:
Describe the size and the variables in the insurance training data set. Consider that too much detail will cause a manager to lose interest while too little detail will make the manager consider that you aren’t doing your job. Some suggestions are given below. Please do NOT treat this as a check list of things to do to complete the assignment. You should have your own thoughts on what to tell the boss. These are just ideas. a. Mean / Standard Deviation / Median b. Bar Chart or Box Plot of the data c. Is the data correlated to the target variable (or to other variables?) d. Are any of the variables missing and need to be imputed “fixed”?
train <- read.csv("https://raw.githubusercontent.com/jjohn81/DATA621_Assignment4/master/insurance_training.csv")
check the dimension of the dataset
dim(train)
## [1] 8161 26
Databases were loaded from github repository: https://github.com/jjohn81/DATA621_Assignment4 In train dataset, there were 8161 observers and 26 variables. In evaluation dataset, there were 2141 observers and 26 variables.
Summarize dataset into median, quantile, min, max
summary(train)
## INDEX TARGET_FLAG TARGET_AMT KIDSDRIV
## Min. : 1 Min. :0.0000 Min. : 0 Min. :0.0000
## 1st Qu.: 2559 1st Qu.:0.0000 1st Qu.: 0 1st Qu.:0.0000
## Median : 5133 Median :0.0000 Median : 0 Median :0.0000
## Mean : 5152 Mean :0.2638 Mean : 1504 Mean :0.1711
## 3rd Qu.: 7745 3rd Qu.:1.0000 3rd Qu.: 1036 3rd Qu.:0.0000
## Max. :10302 Max. :1.0000 Max. :107586 Max. :4.0000
##
## AGE HOMEKIDS YOJ INCOME
## Min. :16.00 Min. :0.0000 Min. : 0.0 $0 : 615
## 1st Qu.:39.00 1st Qu.:0.0000 1st Qu.: 9.0 : 445
## Median :45.00 Median :0.0000 Median :11.0 $26,840 : 4
## Mean :44.79 Mean :0.7212 Mean :10.5 $48,509 : 4
## 3rd Qu.:51.00 3rd Qu.:1.0000 3rd Qu.:13.0 $61,790 : 4
## Max. :81.00 Max. :5.0000 Max. :23.0 $107,375: 3
## NA's :6 NA's :454 (Other) :7086
## PARENT1 HOME_VAL MSTATUS SEX EDUCATION
## No :7084 $0 :2294 Yes :4894 M :3786 <High School :1203
## Yes:1077 : 464 z_No:3267 z_F:4375 Bachelors :2242
## $111,129: 3 Masters :1658
## $115,249: 3 PhD : 728
## $123,109: 3 z_High School:2330
## $153,061: 3
## (Other) :5391
## JOB TRAVTIME CAR_USE BLUEBOOK
## z_Blue Collar:1825 Min. : 5.00 Commercial:3029 $1,500 : 157
## Clerical :1271 1st Qu.: 22.00 Private :5132 $6,000 : 34
## Professional :1117 Median : 33.00 $5,800 : 33
## Manager : 988 Mean : 33.49 $6,200 : 33
## Lawyer : 835 3rd Qu.: 44.00 $6,400 : 31
## Student : 712 Max. :142.00 $5,900 : 30
## (Other) :1413 (Other):7843
## TIF CAR_TYPE RED_CAR OLDCLAIM
## Min. : 1.000 Minivan :2145 no :5783 $0 :5009
## 1st Qu.: 1.000 Panel Truck: 676 yes:2378 $1,310 : 4
## Median : 4.000 Pickup :1389 $1,391 : 4
## Mean : 5.351 Sports Car : 907 $4,263 : 4
## 3rd Qu.: 7.000 Van : 750 $1,105 : 3
## Max. :25.000 z_SUV :2294 $1,332 : 3
## (Other):3134
## CLM_FREQ REVOKED MVR_PTS CAR_AGE
## Min. :0.0000 No :7161 Min. : 0.000 Min. :-3.000
## 1st Qu.:0.0000 Yes:1000 1st Qu.: 0.000 1st Qu.: 1.000
## Median :0.0000 Median : 1.000 Median : 8.000
## Mean :0.7986 Mean : 1.696 Mean : 8.328
## 3rd Qu.:2.0000 3rd Qu.: 3.000 3rd Qu.:12.000
## Max. :5.0000 Max. :13.000 Max. :28.000
## NA's :510
## URBANICITY
## Highly Urban/ Urban :6492
## z_Highly Rural/ Rural:1669
##
##
##
##
##
Check the structure of dataset
str(train)
## 'data.frame': 8161 obs. of 26 variables:
## $ INDEX : int 1 2 4 5 6 7 8 11 12 13 ...
## $ TARGET_FLAG: int 0 0 0 0 0 1 0 1 1 0 ...
## $ TARGET_AMT : num 0 0 0 0 0 ...
## $ KIDSDRIV : int 0 0 0 0 0 0 0 1 0 0 ...
## $ AGE : int 60 43 35 51 50 34 54 37 34 50 ...
## $ HOMEKIDS : int 0 0 1 0 0 1 0 2 0 0 ...
## $ YOJ : int 11 11 10 14 NA 12 NA NA 10 7 ...
## $ INCOME : Factor w/ 6613 levels "","$0","$1,007",..: 5033 6292 1250 1 509 746 1488 315 4765 282 ...
## $ PARENT1 : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 2 1 1 1 1 ...
## $ HOME_VAL : Factor w/ 5107 levels "","$0","$100,093",..: 2 3259 348 3917 3034 2 1 4167 2 2 ...
## $ MSTATUS : Factor w/ 2 levels "Yes","z_No": 2 2 1 1 1 2 1 1 2 2 ...
## $ SEX : Factor w/ 2 levels "M","z_F": 1 1 2 1 2 2 2 1 2 1 ...
## $ EDUCATION : Factor w/ 5 levels "<High School",..: 4 5 5 1 4 2 1 2 2 2 ...
## $ JOB : Factor w/ 9 levels "","Clerical",..: 7 9 2 9 3 9 9 9 2 7 ...
## $ TRAVTIME : int 14 22 5 32 36 46 33 44 34 48 ...
## $ CAR_USE : Factor w/ 2 levels "Commercial","Private": 2 1 2 2 2 1 2 1 2 1 ...
## $ BLUEBOOK : Factor w/ 2789 levels "$1,500","$1,520",..: 434 503 2212 553 802 746 2672 701 135 852 ...
## $ TIF : int 11 1 4 7 1 1 1 1 1 7 ...
## $ CAR_TYPE : Factor w/ 6 levels "Minivan","Panel Truck",..: 1 1 6 1 6 4 6 5 6 5 ...
## $ RED_CAR : Factor w/ 2 levels "no","yes": 2 2 1 2 1 1 1 2 1 1 ...
## $ OLDCLAIM : Factor w/ 2857 levels "$0","$1,000",..: 1449 1 1311 1 432 1 1 510 1 1 ...
## $ CLM_FREQ : int 2 0 2 0 2 0 0 1 0 0 ...
## $ REVOKED : Factor w/ 2 levels "No","Yes": 1 1 1 1 2 1 1 2 1 1 ...
## $ MVR_PTS : int 3 0 3 0 3 0 0 10 0 1 ...
## $ CAR_AGE : int 18 1 10 6 17 7 1 7 1 17 ...
## $ URBANICITY : Factor w/ 2 levels "Highly Urban/ Urban",..: 1 1 1 1 1 1 1 1 1 2 ...
visualize missing data
vis_miss(train)
Detailed missing data information
sapply(train, function(x) sum(is.na(x))) %>% kable() %>% kable_styling()
x | |
---|---|
INDEX | 0 |
TARGET_FLAG | 0 |
TARGET_AMT | 0 |
KIDSDRIV | 0 |
AGE | 6 |
HOMEKIDS | 0 |
YOJ | 454 |
INCOME | 0 |
PARENT1 | 0 |
HOME_VAL | 0 |
MSTATUS | 0 |
SEX | 0 |
EDUCATION | 0 |
JOB | 0 |
TRAVTIME | 0 |
CAR_USE | 0 |
BLUEBOOK | 0 |
TIF | 0 |
CAR_TYPE | 0 |
RED_CAR | 0 |
OLDCLAIM | 0 |
CLM_FREQ | 0 |
REVOKED | 0 |
MVR_PTS | 0 |
CAR_AGE | 510 |
URBANICITY | 0 |
So there are 6 missing data in Age, 454 missing data of YOK, 445 missing data of income, 464 missing data of home value.
Data transformation
#remove "$"
money = function(input) {
out = sub("\\$", "", input)
out = as.numeric(sub(",", "", out))
return(out)
}
# Remove " ", replace with "_"
underscore = function(input) {
out = sub(" ", "_", input)
return(out)
}
train = as.tbl(train) %>%
mutate_at(c("INCOME","HOME_VAL","BLUEBOOK","OLDCLAIM"),
money) %>%
mutate_at(c("EDUCATION","JOB","CAR_TYPE","URBANICITY"),
underscore) %>%
mutate_at(c("EDUCATION","JOB","CAR_TYPE","URBANICITY"),
as.factor) %>%
mutate(TARGET_FLAG = as.factor(TARGET_FLAG))
1: remove $ sign from INCOME, HOME_VAL, BLUEBOOK, OLDCLAIM; 2: replace " " with underscore “_" of variables : EDUCATION, JOB, CAR_TYPE, URBANICITY 3: change it as factors for above variabls plus TARGET_FLA
`
Describe how you have transformed the data by changing the original variables or creating new variables. If you did transform the data or create new variables, discuss why you did this. Here are some possible transformations.
Impute missing data with column means
train$AGE[is.na(train$AGE)] <- mean(train$AGE, na.rm=TRUE)
train$YOJ[is.na(train$YOJ)] <- mean(train$YOJ, na.rm=TRUE)
train$HOME_VAL[is.na(train$HOME_VAL)] <- mean(train$HOME_VAL, na.rm=TRUE)
train$CAR_AGE[is.na(train$CAR_AGE)] <- mean(train$CAR_AGE, na.rm=TRUE)
train$INCOME[is.na(train$INCOME)] <- mean(train$INCOME, na.rm=TRUE)
train <- train[complete.cases(train),]
vis_miss(train)
So far, missing data have been fixed.
Save the completed data to train_clean
train_clean <- train
Remove INDEX columns
train_clean$INDEX <- NULL
#Convert indicator variables to 0s and 1s; 1 = Yes, Male for Sex, Commercial for Car Use, Red for RED_CAR, and Highly Urban for URBANICITY
train_clean$PARENT1 <- ifelse(train_clean$PARENT1=="Yes", 1, 0)
train_clean$MSTATUS <- ifelse(train_clean$MSTATUS=="Yes", 1, 0)
train_clean$SEX <- ifelse(train_clean$SEX=="M", 1, 0)
train_clean$CAR_USE <- ifelse(train_clean$CAR_USE=="Commercial", 1, 0)
train_clean$RED_CAR <- ifelse(train_clean$RED_CAR=="Yes", 1, 0)
train_clean$REVOKED <- ifelse(train_clean$REVOKED=="Yes", 1, 0)
train_clean$URBANICITY <- ifelse(train_clean$URBANICITY == "Highly Urban/ Urban", 1, 0)
#Convert categorical predictor values to indicator variables - EDUCATION, CAR_TYPE, JOB
#EDUCATION, High school graduate is base case
train_clean$HSDropout <- ifelse(train_clean$EDUCATION=="<High School", 1, 0)
train_clean$Bachelors <- ifelse(train_clean$EDUCATION=="Bachelors", 1, 0)
train_clean$Masters <- ifelse(train_clean$EDUCATION=="Masters", 1, 0)
train_clean$PhD <- ifelse(train_clean$EDUCATION=="PhD", 1, 0)
#CAR_TYPE, base case is minivan
train_clean$Panel_Truck <- ifelse(train_clean$CAR_TYPE=="Panel Truck", 1, 0)
train_clean$Pickup <- ifelse(train_clean$CAR_TYPE=="Pickup", 1, 0)
train_clean$Sports_Car <- ifelse(train_clean$CAR_TYPE=="Sports Car", 1, 0)
train_clean$Van <- ifelse(train_clean$CAR_TYPE=="Van", 1, 0)
train_clean$SUV <- ifelse(train_clean$CAR_TYPE=="z_SUV", 1, 0)
#JOB, base case is ""
train_clean$Professional <- ifelse(train_clean$JOB == "Professional", 1, 0)
train_clean$Blue_Collar <- ifelse(train_clean$JOB == "Professional", 1, 0)
train_clean$Clerical <- ifelse(train_clean$JOB == "Clerical", 1, 0)
train_clean$Doctor <- ifelse(train_clean$JOB == "Doctor", 1, 0)
train_clean$Lawyer <- ifelse(train_clean$JOB == "Lawyer", 1, 0)
train_clean$Manager <- ifelse(train_clean$JOB == "Manager", 1, 0)
train_clean$Home_Maker <- ifelse(train_clean$JOB == "Home Maker", 1, 0)
train_clean$Student <- ifelse(train_clean$JOB == "Student", 1, 0)
INCOME, HOME_VAL, BLUEBOOK, and OLDCLAIM are represented as strings. So we will be extracting the numeric values for these.
train_clean$INCOME <- as.numeric(train_clean$INCOME)
train_clean$HOME_VAL <- as.numeric(train_clean$HOME_VAL)
train_clean$BLUEBOOK <- as.numeric(train_clean$BLUEBOOK)
train_clean$OLDCLAIM <- as.numeric(train_clean$OLDCLAIM)
str(train_clean)
## Classes 'tbl_df', 'tbl' and 'data.frame': 8161 obs. of 42 variables:
## $ TARGET_FLAG : Factor w/ 2 levels "0","1": 1 1 1 1 1 2 1 2 2 1 ...
## $ TARGET_AMT : num 0 0 0 0 0 ...
## $ KIDSDRIV : int 0 0 0 0 0 0 0 1 0 0 ...
## $ AGE : num 60 43 35 51 50 34 54 37 34 50 ...
## $ HOMEKIDS : int 0 0 1 0 0 1 0 2 0 0 ...
## $ YOJ : num 11 11 10 14 10.5 ...
## $ INCOME : num 67349 91449 16039 61898 114986 ...
## $ PARENT1 : num 0 0 0 0 0 1 0 0 0 0 ...
## $ HOME_VAL : num 0 257252 124191 306251 243925 ...
## $ MSTATUS : num 0 0 1 1 1 0 1 1 0 0 ...
## $ SEX : num 1 1 0 1 0 0 0 1 0 1 ...
## $ EDUCATION : Factor w/ 5 levels "<High_School",..: 4 5 5 1 4 2 1 2 2 2 ...
## $ JOB : Factor w/ 9 levels "","Clerical",..: 7 9 2 9 3 9 9 9 2 7 ...
## $ TRAVTIME : int 14 22 5 32 36 46 33 44 34 48 ...
## $ CAR_USE : num 0 1 0 0 0 1 0 1 0 1 ...
## $ BLUEBOOK : num 14230 14940 4010 15440 18000 ...
## $ TIF : int 11 1 4 7 1 1 1 1 1 7 ...
## $ CAR_TYPE : Factor w/ 6 levels "Minivan","Panel_Truck",..: 1 1 6 1 6 4 6 5 6 5 ...
## $ RED_CAR : num 0 0 0 0 0 0 0 0 0 0 ...
## $ OLDCLAIM : num 4461 0 38690 0 19217 ...
## $ CLM_FREQ : int 2 0 2 0 2 0 0 1 0 0 ...
## $ REVOKED : num 0 0 0 0 1 0 0 1 0 0 ...
## $ MVR_PTS : int 3 0 3 0 3 0 0 10 0 1 ...
## $ CAR_AGE : num 18 1 10 6 17 7 1 7 1 17 ...
## $ URBANICITY : num 0 0 0 0 0 0 0 0 0 0 ...
## $ HSDropout : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Bachelors : num 0 0 0 0 0 1 0 1 1 1 ...
## $ Masters : num 0 0 0 0 0 0 0 0 0 0 ...
## $ PhD : num 1 0 0 0 1 0 0 0 0 0 ...
## $ Panel_Truck : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Pickup : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Sports_Car : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Van : num 0 0 0 0 0 0 0 1 0 1 ...
## $ SUV : num 0 0 1 0 1 0 1 0 1 0 ...
## $ Professional: num 1 0 0 0 0 0 0 0 0 1 ...
## $ Blue_Collar : num 1 0 0 0 0 0 0 0 0 1 ...
## $ Clerical : num 0 0 1 0 0 0 0 0 1 0 ...
## $ Doctor : num 0 0 0 0 1 0 0 0 0 0 ...
## $ Lawyer : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Manager : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Home_Maker : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Student : num 0 0 0 0 0 0 0 0 0 0 ...
write.csv(train_clean, file="train_clean.csv")
getwd()
## [1] "C:/Users/tbao/Desktop/CUNY MSDS notes/DATA 621/HW4"
Select numberical variable togeter
ntrain<-select_if(train_clean, is.numeric)
Density plot of variables
ntrain <- as.data.frame((ntrain))
par(mfrow=c(3, 3))
colnames <- dimnames(ntrain)[[2]]
for(col in 2:ncol(train)) {
d <- density(na.omit(ntrain[,col]))
#d <- qqnorm(na.omit(train[,col]))
plot(d, type="n", main=colnames[col])
polygon(d, col="blue", border="gray")
}
From the density plot, we can find INCOME, HOME_VAL, BLUEBOOK and OLDCLAIM are left skewed. We are going to do transformation.
boxcoxfit(train_clean$INCOME[train_clean$INCOME >0])
## Fitted parameters:
## lambda beta sigmasq
## 0.4436021 291.9569344 8541.2268213
##
## Convergence code returned by optim: 0
train_clean$INCOME_MOD <- train_clean$INCOME ^0.443
boxcoxfit(train_clean$HOME_VAL[train_clean$HOME_VAL > 0])
## Fitted parameters:
## lambda beta sigmasq
## 0.1016119 24.1545983 2.1701992
##
## Convergence code returned by optim: 0
train_clean$HOME_VAL_MOD <- train_clean$HOME_VAL^0.102
boxcoxfit(train_clean$BLUEBOOK)
## Fitted parameters:
## lambda beta sigmasq
## 0.4610754 177.4257712 2217.4825612
##
## Convergence code returned by optim: 0
train_clean$BLUEBOOK_MOD <- train_clean$BLUEBOOK^0.461
boxcoxfit(train_clean$OLDCLAIM[train_clean$OLDCLAIM>0])
## Fitted parameters:
## lambda beta sigmasq
## -0.04511237 7.22517933 0.44041250
##
## Convergence code returned by optim: 0
train_clean$OLD_CLAIM_MOD <- log(train_clean$OLDCLAIM + 1)
str(train_clean)
## Classes 'tbl_df', 'tbl' and 'data.frame': 8161 obs. of 46 variables:
## $ TARGET_FLAG : Factor w/ 2 levels "0","1": 1 1 1 1 1 2 1 2 2 1 ...
## $ TARGET_AMT : num 0 0 0 0 0 ...
## $ KIDSDRIV : int 0 0 0 0 0 0 0 1 0 0 ...
## $ AGE : num 60 43 35 51 50 34 54 37 34 50 ...
## $ HOMEKIDS : int 0 0 1 0 0 1 0 2 0 0 ...
## $ YOJ : num 11 11 10 14 10.5 ...
## $ INCOME : num 67349 91449 16039 61898 114986 ...
## $ PARENT1 : num 0 0 0 0 0 1 0 0 0 0 ...
## $ HOME_VAL : num 0 257252 124191 306251 243925 ...
## $ MSTATUS : num 0 0 1 1 1 0 1 1 0 0 ...
## $ SEX : num 1 1 0 1 0 0 0 1 0 1 ...
## $ EDUCATION : Factor w/ 5 levels "<High_School",..: 4 5 5 1 4 2 1 2 2 2 ...
## $ JOB : Factor w/ 9 levels "","Clerical",..: 7 9 2 9 3 9 9 9 2 7 ...
## $ TRAVTIME : int 14 22 5 32 36 46 33 44 34 48 ...
## $ CAR_USE : num 0 1 0 0 0 1 0 1 0 1 ...
## $ BLUEBOOK : num 14230 14940 4010 15440 18000 ...
## $ TIF : int 11 1 4 7 1 1 1 1 1 7 ...
## $ CAR_TYPE : Factor w/ 6 levels "Minivan","Panel_Truck",..: 1 1 6 1 6 4 6 5 6 5 ...
## $ RED_CAR : num 0 0 0 0 0 0 0 0 0 0 ...
## $ OLDCLAIM : num 4461 0 38690 0 19217 ...
## $ CLM_FREQ : int 2 0 2 0 2 0 0 1 0 0 ...
## $ REVOKED : num 0 0 0 0 1 0 0 1 0 0 ...
## $ MVR_PTS : int 3 0 3 0 3 0 0 10 0 1 ...
## $ CAR_AGE : num 18 1 10 6 17 7 1 7 1 17 ...
## $ URBANICITY : num 0 0 0 0 0 0 0 0 0 0 ...
## $ HSDropout : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Bachelors : num 0 0 0 0 0 1 0 1 1 1 ...
## $ Masters : num 0 0 0 0 0 0 0 0 0 0 ...
## $ PhD : num 1 0 0 0 1 0 0 0 0 0 ...
## $ Panel_Truck : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Pickup : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Sports_Car : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Van : num 0 0 0 0 0 0 0 1 0 1 ...
## $ SUV : num 0 0 1 0 1 0 1 0 1 0 ...
## $ Professional : num 1 0 0 0 0 0 0 0 0 1 ...
## $ Blue_Collar : num 1 0 0 0 0 0 0 0 0 1 ...
## $ Clerical : num 0 0 1 0 0 0 0 0 1 0 ...
## $ Doctor : num 0 0 0 0 1 0 0 0 0 0 ...
## $ Lawyer : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Manager : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Home_Maker : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Student : num 0 0 0 0 0 0 0 0 0 0 ...
## $ INCOME_MOD : num 137.7 157.7 72.9 132.7 174.5 ...
## $ HOME_VAL_MOD : num 0 3.56 3.31 3.63 3.54 ...
## $ BLUEBOOK_MOD : num 82.2 84 45.8 85.3 91.6 ...
## $ OLD_CLAIM_MOD: num 8.4 0 10.56 0 9.86 ...
pairs(~MVR_PTS+CLM_FREQ+URBANICITY+HOME_VAL+PARENT1+CAR_USE+OLDCLAIM, data=train_clean, main="Predictors with High Correlattions to Targets", col="slategrey")
train_mod <- train_clean
write.csv(train_mod, file="train_mod.csv")
Using the training data set, build at least two different multiple linear regression models and three different binary logistic regression models, using different variables (or the same variables with different transformations). You may select the variables manually, use an approach such as Forward or Stepwise, use a different approach such as trees, or use a combination of techniques. Describe the techniques you used. If you manually selected a variable for inclusion into the model or exclusion into the model, indicate why this was done.
Discuss the coefficients in the models, do they make sense? For example, if a person has a lot of traffic tickets, you would reasonably expect that person to have more car crashes. If the coefficient is negative (suggesting that the person is a safer driver), then that needs to be discussed. Are you keeping the model even though it is counter intuitive? Why? The boss needs to know.
head(train_clean)
## # A tibble: 6 x 46
## TARGET_FLAG TARGET_AMT KIDSDRIV AGE HOMEKIDS YOJ INCOME PARENT1
## <fct> <dbl> <int> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0 0 0 60 0 11 6.73e4 0
## 2 0 0 0 43 0 11 9.14e4 0
## 3 0 0 0 35 1 10 1.60e4 0
## 4 0 0 0 51 0 14 6.19e4 0
## 5 0 0 0 50 0 10.5 1.15e5 0
## 6 1 2946 0 34 1 12 1.25e5 1
## # ... with 38 more variables: HOME_VAL <dbl>, MSTATUS <dbl>, SEX <dbl>,
## # EDUCATION <fct>, JOB <fct>, TRAVTIME <int>, CAR_USE <dbl>,
## # BLUEBOOK <dbl>, TIF <int>, CAR_TYPE <fct>, RED_CAR <dbl>,
## # OLDCLAIM <dbl>, CLM_FREQ <int>, REVOKED <dbl>, MVR_PTS <int>,
## # CAR_AGE <dbl>, URBANICITY <dbl>, HSDropout <dbl>, Bachelors <dbl>,
## # Masters <dbl>, PhD <dbl>, Panel_Truck <dbl>, Pickup <dbl>,
## # Sports_Car <dbl>, Van <dbl>, SUV <dbl>, Professional <dbl>,
## # Blue_Collar <dbl>, Clerical <dbl>, Doctor <dbl>, Lawyer <dbl>,
## # Manager <dbl>, Home_Maker <dbl>, Student <dbl>, INCOME_MOD <dbl>,
## # HOME_VAL_MOD <dbl>, BLUEBOOK_MOD <dbl>, OLD_CLAIM_MOD <dbl>
train_flag <- train_clean[,-c(2)]
original_full_model <- glm(TARGET_FLAG ~.-INCOME_MOD-HOME_VAL_MOD-BLUEBOOK_MOD-OLD_CLAIM_MOD, data = train_flag, family = binomial(link='logit'))
summary(original_full_model)
##
## Call:
## glm(formula = TARGET_FLAG ~ . - INCOME_MOD - HOME_VAL_MOD - BLUEBOOK_MOD -
## OLD_CLAIM_MOD, family = binomial(link = "logit"), data = train_flag)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.2562 -0.7497 -0.5050 0.7314 2.6763
##
## Coefficients: (19 not defined because of singularities)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.308e+00 3.015e-01 -4.338 1.44e-05 ***
## KIDSDRIV 3.016e-01 5.746e-02 5.249 1.53e-07 ***
## AGE -3.570e-03 3.843e-03 -0.929 0.352925
## HOMEKIDS 3.792e-02 3.527e-02 1.075 0.282307
## YOJ -6.714e-03 8.111e-03 -0.828 0.407792
## INCOME -3.076e-06 1.051e-06 -2.927 0.003423 **
## PARENT1 3.208e-01 1.039e-01 3.087 0.002022 **
## HOME_VAL -1.312e-06 3.318e-07 -3.952 7.75e-05 ***
## MSTATUS -3.933e-01 7.994e-02 -4.920 8.64e-07 ***
## SEX 9.232e-02 9.697e-02 0.952 0.341098
## EDUCATIONBachelors -2.923e-01 1.091e-01 -2.679 0.007385 **
## EDUCATIONMasters -1.709e-01 1.717e-01 -0.995 0.319626
## EDUCATIONPhD -8.169e-02 2.062e-01 -0.396 0.691982
## EDUCATIONz_High_School 2.895e-02 8.851e-02 0.327 0.743612
## JOBClerical -2.363e-03 1.914e-01 -0.012 0.990150
## JOBDoctor -4.431e-01 2.671e-01 -1.659 0.097109 .
## JOBHome_Maker -1.280e-01 2.037e-01 -0.629 0.529619
## JOBLawyer 3.370e-02 1.672e-01 0.202 0.840275
## JOBManager -5.048e-01 1.703e-01 -2.963 0.003044 **
## JOBProfessional 2.300e-02 1.757e-01 0.131 0.895878
## JOBStudent -1.899e-01 2.080e-01 -0.913 0.361214
## JOBz_Blue_Collar 1.168e-01 1.816e-01 0.643 0.520368
## TRAVTIME 6.586e-03 1.733e-03 3.801 0.000144 ***
## CAR_USE 6.247e-01 8.685e-02 7.192 6.37e-13 ***
## BLUEBOOK -1.755e-05 5.055e-06 -3.473 0.000515 ***
## TIF -5.014e-02 7.046e-03 -7.116 1.11e-12 ***
## CAR_TYPEPanel_Truck 5.257e-01 1.557e-01 3.376 0.000736 ***
## CAR_TYPEPickup 5.150e-01 9.663e-02 5.330 9.84e-08 ***
## CAR_TYPESports_Car 8.987e-01 1.249e-01 7.193 6.36e-13 ***
## CAR_TYPEVan 5.612e-01 1.220e-01 4.598 4.26e-06 ***
## CAR_TYPEz_SUV 6.972e-01 1.073e-01 6.499 8.10e-11 ***
## RED_CAR NA NA NA NA
## OLDCLAIM -1.327e-05 3.789e-06 -3.502 0.000461 ***
## CLM_FREQ 3.127e-01 2.730e-02 11.456 < 2e-16 ***
## REVOKED 9.795e-01 8.742e-02 11.204 < 2e-16 ***
## MVR_PTS 1.354e-01 1.308e-02 10.352 < 2e-16 ***
## CAR_AGE -1.086e-03 7.232e-03 -0.150 0.880688
## URBANICITY NA NA NA NA
## HSDropout NA NA NA NA
## Bachelors NA NA NA NA
## Masters NA NA NA NA
## PhD NA NA NA NA
## Panel_Truck NA NA NA NA
## Pickup NA NA NA NA
## Sports_Car NA NA NA NA
## Van NA NA NA NA
## SUV NA NA NA NA
## Professional NA NA NA NA
## Blue_Collar NA NA NA NA
## Clerical NA NA NA NA
## Doctor NA NA NA NA
## Lawyer NA NA NA NA
## Manager NA NA NA NA
## Home_Maker NA NA NA NA
## Student NA NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 9418.0 on 8160 degrees of freedom
## Residual deviance: 7946.3 on 8125 degrees of freedom
## AIC: 8018.3
##
## Number of Fisher Scoring iterations: 5
train_flag <- train_clean[,-c(2)]
transform_model <- glm(TARGET_FLAG ~., data = train_flag, family = binomial(link='logit'))
summary(transform_model)
##
## Call:
## glm(formula = TARGET_FLAG ~ ., family = binomial(link = "logit"),
## data = train_flag)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.3309 -0.7444 -0.4913 0.7536 2.7137
##
## Coefficients: (19 not defined because of singularities)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.061e-01 4.148e-01 -0.497 0.61924
## KIDSDRIV 3.124e-01 5.794e-02 5.391 7.00e-08 ***
## AGE -4.433e-03 3.873e-03 -1.145 0.25236
## HOMEKIDS 2.162e-02 3.579e-02 0.604 0.54587
## YOJ 7.510e-03 9.406e-03 0.798 0.42463
## INCOME 2.271e-06 2.356e-06 0.964 0.33513
## PARENT1 3.108e-01 1.047e-01 2.968 0.00300 **
## HOME_VAL -5.229e-07 6.301e-07 -0.830 0.40657
## MSTATUS -3.816e-01 8.312e-02 -4.591 4.42e-06 ***
## SEX 1.248e-01 9.772e-02 1.277 0.20159
## EDUCATIONBachelors -2.448e-01 1.110e-01 -2.206 0.02735 *
## EDUCATIONMasters -1.121e-01 1.733e-01 -0.647 0.51771
## EDUCATIONPhD -1.183e-01 2.083e-01 -0.568 0.56994
## EDUCATIONz_High_School 6.719e-02 8.987e-02 0.748 0.45467
## JOBClerical 2.629e-02 1.921e-01 0.137 0.89115
## JOBDoctor -3.884e-01 2.656e-01 -1.462 0.14362
## JOBHome_Maker -3.070e-01 2.158e-01 -1.422 0.15495
## JOBLawyer 4.006e-02 1.677e-01 0.239 0.81121
## JOBManager -4.830e-01 1.702e-01 -2.838 0.00454 **
## JOBProfessional 6.366e-02 1.760e-01 0.362 0.71763
## JOBStudent -4.582e-01 2.242e-01 -2.044 0.04094 *
## JOBz_Blue_Collar 1.861e-01 1.824e-01 1.020 0.30757
## TRAVTIME 7.001e-03 1.747e-03 4.008 6.12e-05 ***
## CAR_USE 6.170e-01 8.748e-02 7.053 1.75e-12 ***
## BLUEBOOK 3.753e-05 1.900e-05 1.975 0.04825 *
## TIF -4.941e-02 7.078e-03 -6.981 2.93e-12 ***
## CAR_TYPEPanel_Truck 3.919e-01 1.601e-01 2.448 0.01436 *
## CAR_TYPEPickup 5.215e-01 9.710e-02 5.371 7.84e-08 ***
## CAR_TYPESports_Car 8.933e-01 1.257e-01 7.109 1.17e-12 ***
## CAR_TYPEVan 5.659e-01 1.224e-01 4.625 3.74e-06 ***
## CAR_TYPEz_SUV 7.178e-01 1.085e-01 6.617 3.66e-11 ***
## RED_CAR NA NA NA NA
## OLDCLAIM -3.373e-05 4.655e-06 -7.245 4.32e-13 ***
## CLM_FREQ 5.348e-02 4.274e-02 1.251 0.21077
## REVOKED 1.087e+00 8.931e-02 12.176 < 2e-16 ***
## MVR_PTS 1.064e-01 1.356e-02 7.847 4.25e-15 ***
## CAR_AGE -7.413e-04 7.264e-03 -0.102 0.91872
## URBANICITY NA NA NA NA
## HSDropout NA NA NA NA
## Bachelors NA NA NA NA
## Masters NA NA NA NA
## PhD NA NA NA NA
## Panel_Truck NA NA NA NA
## Pickup NA NA NA NA
## Sports_Car NA NA NA NA
## Van NA NA NA NA
## SUV NA NA NA NA
## Professional NA NA NA NA
## Blue_Collar NA NA NA NA
## Clerical NA NA NA NA
## Doctor NA NA NA NA
## Lawyer NA NA NA NA
## Manager NA NA NA NA
## Home_Maker NA NA NA NA
## Student NA NA NA NA
## INCOME_MOD -6.794e-03 2.218e-03 -3.063 0.00219 **
## HOME_VAL_MOD -6.199e-02 4.413e-02 -1.405 0.16004
## BLUEBOOK_MOD -2.052e-02 6.847e-03 -2.996 0.00273 **
## OLD_CLAIM_MOD 1.164e-01 1.459e-02 7.978 1.49e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 9418.0 on 8160 degrees of freedom
## Residual deviance: 7863.3 on 8121 degrees of freedom
## AIC: 7943.3
##
## Number of Fisher Scoring iterations: 5
train_flag <- train_clean[,-c(2)]
reduced_model <- glm(TARGET_FLAG ~.-AGE-HOMEKIDS-YOJ-INCOME-PARENT1-HOME_VAL-MSTATUS-SEX-RED_CAR-CLM_FREQ-CAR_AGE-HSDropout-Professional-Blue_Collar-Clerical-Lawyer-Home_Maker-HOME_VAL_MOD-Student-Doctor-CAR_USE-REVOKED-URBANICITY-Bachelors-Masters-PhD-Panel_Truck-Pickup-Sports_Car-Van-SUV-Manager, data = train_flag, family = binomial(link='logit'))
summary(reduced_model)
##
## Call:
## glm(formula = TARGET_FLAG ~ . - AGE - HOMEKIDS - YOJ - INCOME -
## PARENT1 - HOME_VAL - MSTATUS - SEX - RED_CAR - CLM_FREQ -
## CAR_AGE - HSDropout - Professional - Blue_Collar - Clerical -
## Lawyer - Home_Maker - HOME_VAL_MOD - Student - Doctor - CAR_USE -
## REVOKED - URBANICITY - Bachelors - Masters - PhD - Panel_Truck -
## Pickup - Sports_Car - Van - SUV - Manager, family = binomial(link = "logit"),
## data = train_flag)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.9313 -0.7729 -0.5421 0.8900 2.4980
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.193e-01 3.601e-01 -1.164 0.244243
## KIDSDRIV 3.607e-01 4.850e-02 7.438 1.02e-13 ***
## EDUCATIONBachelors -1.028e-02 9.662e-02 -0.106 0.915242
## EDUCATIONMasters 5.759e-02 1.495e-01 0.385 0.700140
## EDUCATIONPhD 1.153e-01 1.827e-01 0.631 0.527973
## EDUCATIONz_High_School 2.454e-01 8.257e-02 2.972 0.002958 **
## JOBClerical -1.242e-01 1.855e-01 -0.669 0.503205
## JOBDoctor -6.858e-01 2.553e-01 -2.686 0.007230 **
## JOBHome_Maker -5.750e-01 2.034e-01 -2.827 0.004697 **
## JOBLawyer -2.453e-01 1.577e-01 -1.556 0.119767
## JOBManager -7.001e-01 1.648e-01 -4.247 2.17e-05 ***
## JOBProfessional -1.546e-01 1.697e-01 -0.911 0.362176
## JOBStudent -2.315e-01 2.096e-01 -1.104 0.269477
## JOBz_Blue_Collar 3.197e-01 1.760e-01 1.816 0.069394 .
## TRAVTIME 6.131e-03 1.695e-03 3.617 0.000299 ***
## BLUEBOOK 3.374e-05 1.782e-05 1.893 0.058347 .
## TIF -4.774e-02 6.858e-03 -6.961 3.39e-12 ***
## CAR_TYPEPanel_Truck 8.404e-01 1.359e-01 6.185 6.19e-10 ***
## CAR_TYPEPickup 7.250e-01 8.877e-02 8.167 3.17e-16 ***
## CAR_TYPESports_Car 7.555e-01 1.006e-01 7.512 5.82e-14 ***
## CAR_TYPEVan 8.006e-01 1.114e-01 7.188 6.59e-13 ***
## CAR_TYPEz_SUV 6.330e-01 8.070e-02 7.844 4.35e-15 ***
## OLDCLAIM -7.858e-06 3.724e-06 -2.110 0.034837 *
## MVR_PTS 1.143e-01 1.314e-02 8.699 < 2e-16 ***
## INCOME_MOD -5.370e-03 8.929e-04 -6.014 1.80e-09 ***
## BLUEBOOK_MOD -2.003e-02 6.540e-03 -3.063 0.002189 **
## OLD_CLAIM_MOD 1.038e-01 8.744e-03 11.868 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 9418 on 8160 degrees of freedom
## Residual deviance: 8263 on 8134 degrees of freedom
## AIC: 8317
##
## Number of Fisher Scoring iterations: 4
Decide on the criteria for selecting the best multiple linear regression model and the best binary logistic regression model. Will you select models with slightly worse performance if it makes more sense or is more parsimonious? Discuss why you selected your models.
For the multiple linear regression model, will you use a metric such as Adjusted R2, RMSE, etc.? Be sure to explain how you can make inferences from the model, discuss multi-collinearity issues (if any), and discuss other relevant model output. Using the training data set, evaluate the multiple linear regression model based on (a) mean squared error, (b) R2, (c) F-statistic, and (d) residual plots. For the binary logistic regression model, will you use a metric such as log likelihood, AIC, ROC curve, etc.? Using the training data set, evaluate the binary logistic regression model based on (a) accuracy, (b) classification error rate, (c) precision, (d) sensitivity, (e) specificity, (f) F1 score, (g) AUC, and (h) confusion matrix. Make predictions using the evaluation data set.
We would like to select Reduced Model for Binary Logistic Regression models. The AIC and residual deviance for this model seemed to give the best values that would be suited for the prediction. Below is the ROC curve for model5 and to me it looks good. So i would like to proceed with Reduced Model.
train_flag$predict <- predict(reduced_model, train_flag, type='response')
roc_reduced_model <- roc(train_flag$TARGET_FLAG, train_flag$predict, plot=T, asp=NA,
legacy.axes=T, main = "ROC Curve Reduced Model", col="red")
roc_reduced_model["auc"]
## $auc
## Area under the curve: 0.7364
Build confusion matrix
train_flag$predict_target <- ifelse(train_flag$predict >=0.5, 1, 0)
train_flag$predict_target <- as.integer(train_flag$predict_target)
myvars <- c("TARGET_FLAG", "predict_target")
train_flag_cm <- train_flag[myvars]
cm <- table(train_flag_cm$predict_target,train_flag_cm$TARGET_FLAG)
knitr:: kable(cm)
0 | 1 | |
---|---|---|
0 | 5635 | 1614 |
1 | 373 | 539 |
Accuracy <- function(data) {
tb <- table(train_flag_cm$predict_target,train_flag_cm$TARGET_FLAG)
TN=tb[1,1]
TP=tb[2,2]
FN=tb[2,1]
FP=tb[1,2]
return((TP+TN)/(TP+FP+TN+FN))
}
Accuracy(data)
## [1] 0.7565249
CER <- function(data) {
tb <- table(train_flag_cm$predict_target,train_flag_cm$TARGET_FLAG)
TN=tb[1,1]
TP=tb[2,2]
FN=tb[2,1]
FP=tb[1,2]
return((FP+FN)/(TP+FP+TN+FN))
}
CER(data)
## [1] 0.2434751
Precision <- function(data) {
tb <- table(train_flag_cm$predict_target,train_flag_cm$TARGET_FLAG)
TP=tb[2,2]
FP=tb[1,2]
return((TP)/(TP+FP))
}
Precision(data)
## [1] 0.2503484
Sensitivity <- function(data) {
tb <- table(train_flag_cm$predict_target,train_flag_cm$TARGET_FLAG)
TP=tb[2,2]
FN=tb[2,1]
return((TP)/(TP+FN))
}
Sensitivity(data)
## [1] 0.5910088
Specificity <- function(data) {
tb <- table(train_flag_cm$predict_target,train_flag_cm$TARGET_FLAG)
TN=tb[1,1]
TP=tb[2,2]
FN=tb[2,1]
FP=tb[1,2]
return((TN)/(TN+FP))
}
Specificity(data)
## [1] 0.7773486
F1_score <- function(data) {
tb <- table(train_flag_cm$predict_target,train_flag_cm$TARGET_FLAG)
TN=tb[1,1]
TP=tb[2,2]
FN=tb[2,1]
FP=tb[1,2]
Precision = (TP)/(TP+FP)
Sensitivity = (TP)/(TP+FN)
Precision =(TP)/(TP+FP)
return((2*Precision*Sensitivity)/(Precision+Sensitivity))
}
F1_score(data)
## [1] 0.3517129
evaluation <- read.csv("https://raw.githubusercontent.com/jjohn81/DATA621_Assignment4/master/insurance-evaluation.csv")
evaluation2 <- evaluation
dim(evaluation)
## [1] 2141 26
str(evaluation)
## 'data.frame': 2141 obs. of 26 variables:
## $ INDEX : int 3 9 10 18 21 30 31 37 39 47 ...
## $ TARGET_FLAG: logi NA NA NA NA NA NA ...
## $ TARGET_AMT : logi NA NA NA NA NA NA ...
## $ KIDSDRIV : int 0 1 0 0 0 0 0 0 2 0 ...
## $ AGE : int 48 40 44 35 59 46 60 54 36 50 ...
## $ HOMEKIDS : int 0 1 2 2 0 0 0 0 2 0 ...
## $ YOJ : int 11 11 12 NA 12 14 12 12 12 8 ...
## $ INCOME : Factor w/ 1804 levels "","$0","$1,249",..: 1154 1119 974 513 1686 1 863 766 233 387 ...
## $ PARENT1 : Factor w/ 2 levels "No","Yes": 1 2 2 2 1 1 1 1 2 1 ...
## $ HOME_VAL : Factor w/ 1398 levels "","$0","$100,046",..: 2 2 2 2 2 636 468 297 1161 2 ...
## $ MSTATUS : Factor w/ 2 levels "Yes","z_No": 2 2 2 2 2 1 1 1 2 2 ...
## $ SEX : Factor w/ 2 levels "M","z_F": 1 1 2 1 1 1 2 1 2 2 ...
## $ EDUCATION : Factor w/ 5 levels "<High School",..: 2 5 5 5 5 2 5 1 2 4 ...
## $ JOB : Factor w/ 9 levels "","Clerical",..: 6 6 9 2 6 7 9 9 9 3 ...
## $ TRAVTIME : int 26 21 30 74 45 7 16 27 5 22 ...
## $ CAR_USE : Factor w/ 2 levels "Commercial","Private": 2 2 1 2 2 1 1 1 1 2 ...
## $ BLUEBOOK : Factor w/ 1417 levels "$1,500","$1,530",..: 703 540 1189 1373 345 864 95 799 913 1075 ...
## $ TIF : int 1 6 10 6 1 1 1 4 4 4 ...
## $ CAR_TYPE : Factor w/ 6 levels "Minivan","Panel Truck",..: 5 1 6 3 1 2 4 2 1 4 ...
## $ RED_CAR : Factor w/ 2 levels "no","yes": 2 1 1 1 2 1 1 1 1 1 ...
## $ OLDCLAIM : Factor w/ 834 levels "$0","$1,001",..: 1 272 1 1 494 137 1 1 1 1 ...
## $ CLM_FREQ : int 0 1 0 0 2 1 0 0 0 0 ...
## $ REVOKED : Factor w/ 2 levels "No","Yes": 1 1 1 2 1 1 1 1 1 1 ...
## $ MVR_PTS : int 2 2 0 0 4 2 0 5 0 3 ...
## $ CAR_AGE : int 10 1 10 4 1 12 1 NA 9 1 ...
## $ URBANICITY : Factor w/ 2 levels "Highly Urban/ Urban",..: 1 1 2 2 1 1 1 1 2 1 ...
evaluation$INDEX <- NULL
evaluation$TARGET_AMT <- 0
evaluation$TARGET_FLAG <- 0
str(evaluation)
## 'data.frame': 2141 obs. of 25 variables:
## $ TARGET_FLAG: num 0 0 0 0 0 0 0 0 0 0 ...
## $ TARGET_AMT : num 0 0 0 0 0 0 0 0 0 0 ...
## $ KIDSDRIV : int 0 1 0 0 0 0 0 0 2 0 ...
## $ AGE : int 48 40 44 35 59 46 60 54 36 50 ...
## $ HOMEKIDS : int 0 1 2 2 0 0 0 0 2 0 ...
## $ YOJ : int 11 11 12 NA 12 14 12 12 12 8 ...
## $ INCOME : Factor w/ 1804 levels "","$0","$1,249",..: 1154 1119 974 513 1686 1 863 766 233 387 ...
## $ PARENT1 : Factor w/ 2 levels "No","Yes": 1 2 2 2 1 1 1 1 2 1 ...
## $ HOME_VAL : Factor w/ 1398 levels "","$0","$100,046",..: 2 2 2 2 2 636 468 297 1161 2 ...
## $ MSTATUS : Factor w/ 2 levels "Yes","z_No": 2 2 2 2 2 1 1 1 2 2 ...
## $ SEX : Factor w/ 2 levels "M","z_F": 1 1 2 1 1 1 2 1 2 2 ...
## $ EDUCATION : Factor w/ 5 levels "<High School",..: 2 5 5 5 5 2 5 1 2 4 ...
## $ JOB : Factor w/ 9 levels "","Clerical",..: 6 6 9 2 6 7 9 9 9 3 ...
## $ TRAVTIME : int 26 21 30 74 45 7 16 27 5 22 ...
## $ CAR_USE : Factor w/ 2 levels "Commercial","Private": 2 2 1 2 2 1 1 1 1 2 ...
## $ BLUEBOOK : Factor w/ 1417 levels "$1,500","$1,530",..: 703 540 1189 1373 345 864 95 799 913 1075 ...
## $ TIF : int 1 6 10 6 1 1 1 4 4 4 ...
## $ CAR_TYPE : Factor w/ 6 levels "Minivan","Panel Truck",..: 5 1 6 3 1 2 4 2 1 4 ...
## $ RED_CAR : Factor w/ 2 levels "no","yes": 2 1 1 1 2 1 1 1 1 1 ...
## $ OLDCLAIM : Factor w/ 834 levels "$0","$1,001",..: 1 272 1 1 494 137 1 1 1 1 ...
## $ CLM_FREQ : int 0 1 0 0 2 1 0 0 0 0 ...
## $ REVOKED : Factor w/ 2 levels "No","Yes": 1 1 1 2 1 1 1 1 1 1 ...
## $ MVR_PTS : int 2 2 0 0 4 2 0 5 0 3 ...
## $ CAR_AGE : int 10 1 10 4 1 12 1 NA 9 1 ...
## $ URBANICITY : Factor w/ 2 levels "Highly Urban/ Urban",..: 1 1 2 2 1 1 1 1 2 1 ...
#remove "$"
money = function(input) {
out = sub("\\$", "", input)
out = as.numeric(sub(",", "", out))
return(out)
}
# Remove " ", replace with "_"
underscore = function(input) {
out = sub(" ", "_", input)
return(out)
}
evaluation = as.tbl(evaluation) %>%
mutate_at(c("INCOME","HOME_VAL","BLUEBOOK","OLDCLAIM"),
money) %>%
mutate_at(c("EDUCATION","JOB","CAR_TYPE","URBANICITY"),
underscore) %>%
mutate_at(c("EDUCATION","JOB","CAR_TYPE","URBANICITY"),
as.factor) %>%
mutate(TARGET_FLAG = as.factor(TARGET_FLAG))
evaluation$AGE[is.na(evaluation$AGE)] <- mean(evaluation$AGE, na.rm=TRUE)
evaluation$YOJ[is.na(evaluation$YOJ)] <- mean(evaluation$YOJ, na.rm=TRUE)
evaluation$HOME_VAL[is.na(evaluation$HOME_VAL)] <- mean(evaluation$HOME_VAL, na.rm=TRUE)
evaluation$CAR_AGE[is.na(evaluation$CAR_AGE)] <- mean(evaluation$CAR_AGE, na.rm=TRUE)
evaluation$INCOME[is.na(evaluation$INCOME)] <- mean(evaluation$INCOME, na.rm=TRUE)
evaluation <- evaluation[complete.cases(evaluation),]
evaluation$INCOME <- as.numeric(evaluation$INCOME)
evaluation$HOME_VAL <- as.numeric(evaluation$HOME_VAL)
evaluation$BLUEBOOK <- as.numeric(evaluation$BLUEBOOK)
evaluation$OLDCLAIM <- as.numeric(evaluation$OLDCLAIM)
evaluation$INCOME_MOD <- evaluation$INCOME ^0.433
evaluation$HOME_VAL_MOD <- evaluation$HOME_VAL^0.102
evaluation$BLUEBOOK_MOD <- evaluation$BLUEBOOK^0.461
evaluation$OLD_CLAIM_MOD <- log(evaluation$OLDCLAIM + 1)
evaluation$PARENT1 <- ifelse(evaluation$PARENT1=="Yes", 1, 0)
evaluation$MSTATUS <- ifelse(evaluation$MSTATUS=="Yes", 1, 0)
evaluation$SEX <- ifelse(evaluation$SEX=="M", 1, 0)
evaluation$CAR_USE <- ifelse(evaluation$CAR_USE=="Commercial", 1, 0)
evaluation$RED_CAR <- ifelse(evaluation$RED_CAR=="Yes", 1, 0)
evaluation$REVOKED <- ifelse(evaluation$REVOKED=="Yes", 1, 0)
evaluation$URBANICITY <- ifelse(evaluation$URBANICITY == "Highly Urban/ Urban", 1, 0)
evaluation$HSDropout <- ifelse(evaluation$EDUCATION=="<High School", 1, 0)
evaluation$Bachelors <- ifelse(evaluation$EDUCATION=="Bachelors", 1, 0)
evaluation$Masters <- ifelse(evaluation$EDUCATION=="Masters", 1, 0)
evaluation$PhD <- ifelse(evaluation$EDUCATION=="PhD", 1, 0)
evaluation$Panel_Truck <- ifelse(evaluation$CAR_TYPE=="Panel Truck", 1, 0)
evaluation$Pickup <- ifelse(evaluation$CAR_TYPE=="Pickup", 1, 0)
evaluation$Sports_Car <- ifelse(evaluation$CAR_TYPE=="Sports Car", 1, 0)
evaluation$Van <- ifelse(evaluation$CAR_TYPE=="Van", 1, 0)
evaluation$SUV <- ifelse(evaluation$CAR_TYPE=="z_SUV", 1, 0)
evaluation$Professional <- ifelse(evaluation$JOB == "Professional", 1, 0)
evaluation$Blue_Collar <- ifelse(evaluation$JOB == "Professional", 1, 0)
evaluation$Clerical <- ifelse(evaluation$JOB == "Clerical", 1, 0)
evaluation$Doctor <- ifelse(evaluation$JOB == "Doctor", 1, 0)
evaluation$Lawyer <- ifelse(evaluation$JOB == "Lawyer", 1, 0)
evaluation$Manager <- ifelse(evaluation$JOB == "Manager", 1, 0)
evaluation$Home_Maker <- ifelse(evaluation$JOB == "Home Maker", 1, 0)
evaluation$Student <- ifelse(evaluation$JOB == "Student", 1, 0)
evaluation_mod <- evaluation
write.csv(evaluation_mod, file="evaluation_mod.csv")
TARGET_FLAG <- predict(reduced_model, newdata = evaluation, type="response")
y_pred_num <- ifelse(TARGET_FLAG > 0.5, 1, 0)
y_pred <- factor(y_pred_num, levels=c(0, 1))
summary(y_pred)
## 0 1
## 1880 261