In this homework assignment, you will explore, analyze and model a data set containing approximately 8,000 records representing a customer at an auto insurance company. Each record has two response variables. The first response variable, TARGET_FLAG, is a 1 or a 0. A “1” means that the person was in a car crash. A zero means that the person was not in a car crash. The second response variable is TARGET_AMT. This value is zero if the person did not crash their car. But if they did crash their car, this number will be a value greater than zero.
Your objective is to build multiple linear regression and binary logistic regression models on the training data to predict the probability that a person will crash their car and also the amount of money it will cost if the person does crash their car. You can only use the variables given to you (or variables that you derive from the variables provided). Below is a short description of the variables of interest in the data set:
VARIABLE NAME DEFINITION THEORETICAL EFFECT INDEX Identification Variable (do not use) None TARGET_FLAG Was Car in a crash? 1=YES 0=NO None TARGET_AMT If car was in a crash, what was the cost None AGE Age of Driver Very young people tend to be risky. Maybe very old people also. BLUEBOOK Value of Vehicle Unknown effect on probability of collision, but probably effect the payout if there is a crash CAR_AGE Vehicle Age Unknown effect on probability of collision, but probably effect the payout if there is a crash CAR_TYPE Type of Car Unknown effect on probability of collision, but probably effect the payout if there is a crash CAR_USE Vehicle Use Commercial vehicles are driven more, so might increase probability of collision CLM_FREQ # Claims (Past 5 Years) The more claims you filed in the past, the more you are likely to file in the future EDUCATION Max Education Level Unknown effect, but in theory more educated people tend to drive more safely HOMEKIDS # Children at Home Unknown effect HOME_VAL Home Value In theory, home owners tend to drive more responsibly INCOME Income In theory, rich people tend to get into fewer crashes JOB Job Category In theory, white collar jobs tend to be safer KIDSDRIV # Driving Children When teenagers drive your car, you are more likely to get into crashes MSTATUS Marital Status In theory, married people drive more safely MVR_PTS Motor Vehicle Record Points If you get lots of traffic tickets, you tend to get into more crashes OLDCLAIM Total Claims (Past 5 Years) If your total payout over the past five years was high, this suggests future payouts will be high PARENT1 Single Parent Unknown effect RED_CAR A Red Car Urban legend says that red cars (especially red sports cars) are more risky. Is that true? REVOKED License Revoked (Past 7 Years) If your license was revoked in the past 7 years, you probably are a more risky driver. SEX Gender Urban legend says that women have less crashes then men. Is that true? TIF Time in Force People who have been customers for a long time are usually more safe. TRAV TIME Distance to Work Long drives to work usually suggest greater risk URBANICITY Home/Work Area Unknown YOJ Years on Job People who stay at a job for a long time are usually more safe
Deliverables:
A write-up submitted in PDF format. Your write-up should have four sections. Each one is described below. You may assume you are addressing me as a fellow data scientist, so do not need to shy away from technical details.
Assigned predictions (probabilities, classifications, cost) for the evaluation data set. Use 0.5 threshold.
Include your R statistical programming code in an Appendix.
Write Up:
Describe the size and the variables in the insurance training data set. Consider that too much detail will cause a manager to lose interest while too little detail will make the manager consider that you aren’t doing your job. Some suggestions are given below. Please do NOT treat this as a check list of things to do to complete the assignment. You should have your own thoughts on what to tell the boss. These are just ideas.
Per guidance in the problem, I’ve included below both a numerical summary (mean, median, standard deviation) of numeric variables in the training dataset as well as histogram plots for those variables. We also observe that there are some variables such as INCOME, HOME_VAL, BLUEBOOK and OLDCLAIM which should be numeric but have come through as characters. We convert such variables to numeric class to allow analysis.
We notice there are few variables with NA’s - Age, YOJ, Income, Home Value and Car Age. Additionally, there are several variables (TARGET_FLAG, TARGET_AMT, KIDSDRIV, HOMEKIDS, YOJ, INCOME, HOME_VAL, TIF, OLDCLAIM, CLM_FREQ, MVR_PTS) which have minimum values of 0’s. Upon investigating the records in the file, all of these columns thta have zero values appear to be legitimate and not ones entered in error. Finally, a correlation plot indicates the numeric correlation values (-1 to +1) of the quantitative variables in the dataset. Similarly, looking at the maximum values doesn’t cause any alarm.
A visual plot of the variables indicates most have a skewed distribution. There appears to be only a few normally distributed variables - Age of driver and Years on job.
From the correlation matrix plot, we identify a few variables that are moderately or weakly positively correlated - HOME_VAL & INCOME (0.58), TARGET_FLAG & TARGET_AMT (0.54), CLM_FREQ & OLDCLAIM (0.49), BLUEBOOK & INCOME (0.43), CAR_AGE & INCOME (0.41), MVR_PTS & CLM_FREQ (0.40), INCOME & YOJ (0.28), HOME_VAL & YOJ (0.27), MVR_PTS & OLDCLAIM (0.27), BLUEBOOK & HOME_VAL (0.26). Similarly, there are some variables which are moderately negatively correlated - HOMEKIDS & AGE (-0.45).
This knowledge helps us narrow the size of the predictive model by including fewer variables. For example, HOME_VAL & INCOME have a positive correlation of 0.58 indicating that the higher the income of the driver, the higher his home value which makes intuitive sense. Also HOMEKIDS & AGE are inversely correlated, which means the younger the driver, the more the number of kids at home perhaps because his driving license is a training permit and he’s not yet allowed to drive young children around.
We look at the target variable which indicates whether a car was involved in crash or not and observe that the target has a moderate positive correlation to TARGET_AMT (0.54), MVR_PTS (0.23), CLM_FREQ (0.22) and OLDCLAIM (0.14). This makes intuitive sense since if the car was involved in a crash, there will be a cost associated with it. Furthermore, it is likely if the car was involved in a crash that you were previously issued traffic tickets and that you likely filed some claims in the past and that there was a monetary value associated with those past claims.
library(rmarkdown)
library(corrplot)
## corrplot 0.92 loaded
library(stringr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
#Reading in the training and evaluation data files
training <- read.csv("/Users/tponnada/Downloads/insurance_training_data.csv")
eval <- read.csv("/Users/tponnada/Downloads/insurance-evaluation-data.csv")
#Checking the first 6 rows of the training data set, the dimensions of the data set and the usual univariate summary information.
head(training)
## INDEX TARGET_FLAG TARGET_AMT KIDSDRIV AGE HOMEKIDS YOJ INCOME PARENT1
## 1 1 0 0 0 60 0 11 $67,349 No
## 2 2 0 0 0 43 0 11 $91,449 No
## 3 4 0 0 0 35 1 10 $16,039 No
## 4 5 0 0 0 51 0 14 No
## 5 6 0 0 0 50 0 NA $114,986 No
## 6 7 1 2946 0 34 1 12 $125,301 Yes
## HOME_VAL MSTATUS SEX EDUCATION JOB TRAVTIME CAR_USE BLUEBOOK
## 1 $0 z_No M PhD Professional 14 Private $14,230
## 2 $257,252 z_No M z_High School z_Blue Collar 22 Commercial $14,940
## 3 $124,191 Yes z_F z_High School Clerical 5 Private $4,010
## 4 $306,251 Yes M <High School z_Blue Collar 32 Private $15,440
## 5 $243,925 Yes z_F PhD Doctor 36 Private $18,000
## 6 $0 z_No z_F Bachelors z_Blue Collar 46 Commercial $17,430
## TIF CAR_TYPE RED_CAR OLDCLAIM CLM_FREQ REVOKED MVR_PTS CAR_AGE
## 1 11 Minivan yes $4,461 2 No 3 18
## 2 1 Minivan yes $0 0 No 0 1
## 3 4 z_SUV no $38,690 2 No 3 10
## 4 7 Minivan yes $0 0 No 0 6
## 5 1 z_SUV no $19,217 2 Yes 3 17
## 6 1 Sports Car no $0 0 No 0 7
## URBANICITY
## 1 Highly Urban/ Urban
## 2 Highly Urban/ Urban
## 3 Highly Urban/ Urban
## 4 Highly Urban/ Urban
## 5 Highly Urban/ Urban
## 6 Highly Urban/ Urban
dim(training)
## [1] 8161 26
summary(training)
## INDEX TARGET_FLAG TARGET_AMT KIDSDRIV
## Min. : 1 Min. :0.0000 Min. : 0 Min. :0.0000
## 1st Qu.: 2559 1st Qu.:0.0000 1st Qu.: 0 1st Qu.:0.0000
## Median : 5133 Median :0.0000 Median : 0 Median :0.0000
## Mean : 5152 Mean :0.2638 Mean : 1504 Mean :0.1711
## 3rd Qu.: 7745 3rd Qu.:1.0000 3rd Qu.: 1036 3rd Qu.:0.0000
## Max. :10302 Max. :1.0000 Max. :107586 Max. :4.0000
##
## AGE HOMEKIDS YOJ INCOME
## Min. :16.00 Min. :0.0000 Min. : 0.0 Length:8161
## 1st Qu.:39.00 1st Qu.:0.0000 1st Qu.: 9.0 Class :character
## Median :45.00 Median :0.0000 Median :11.0 Mode :character
## Mean :44.79 Mean :0.7212 Mean :10.5
## 3rd Qu.:51.00 3rd Qu.:1.0000 3rd Qu.:13.0
## Max. :81.00 Max. :5.0000 Max. :23.0
## NA's :6 NA's :454
## PARENT1 HOME_VAL MSTATUS SEX
## Length:8161 Length:8161 Length:8161 Length:8161
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## EDUCATION JOB TRAVTIME CAR_USE
## Length:8161 Length:8161 Min. : 5.00 Length:8161
## Class :character Class :character 1st Qu.: 22.00 Class :character
## Mode :character Mode :character Median : 33.00 Mode :character
## Mean : 33.49
## 3rd Qu.: 44.00
## Max. :142.00
##
## BLUEBOOK TIF CAR_TYPE RED_CAR
## Length:8161 Min. : 1.000 Length:8161 Length:8161
## Class :character 1st Qu.: 1.000 Class :character Class :character
## Mode :character Median : 4.000 Mode :character Mode :character
## Mean : 5.351
## 3rd Qu.: 7.000
## Max. :25.000
##
## OLDCLAIM CLM_FREQ REVOKED MVR_PTS
## Length:8161 Min. :0.0000 Length:8161 Min. : 0.000
## Class :character 1st Qu.:0.0000 Class :character 1st Qu.: 0.000
## Mode :character Median :0.0000 Mode :character Median : 1.000
## Mean :0.7986 Mean : 1.696
## 3rd Qu.:2.0000 3rd Qu.: 3.000
## Max. :5.0000 Max. :13.000
##
## CAR_AGE URBANICITY
## Min. :-3.000 Length:8161
## 1st Qu.: 1.000 Class :character
## Median : 8.000 Mode :character
## Mean : 8.328
## 3rd Qu.:12.000
## Max. :28.000
## NA's :510
#Convert variables to numeric class
training$INCOME <- as.numeric(gsub('[$,]', '', training$INCOME))
training$HOME_VAL <- as.numeric(gsub('[$,]', '', training$HOME_VAL))
training$BLUEBOOK <- as.numeric(gsub('[$,]', '', training$BLUEBOOK))
training$OLDCLAIM <- as.numeric(gsub('[$,]', '', training$OLDCLAIM))
summary(training)
## INDEX TARGET_FLAG TARGET_AMT KIDSDRIV
## Min. : 1 Min. :0.0000 Min. : 0 Min. :0.0000
## 1st Qu.: 2559 1st Qu.:0.0000 1st Qu.: 0 1st Qu.:0.0000
## Median : 5133 Median :0.0000 Median : 0 Median :0.0000
## Mean : 5152 Mean :0.2638 Mean : 1504 Mean :0.1711
## 3rd Qu.: 7745 3rd Qu.:1.0000 3rd Qu.: 1036 3rd Qu.:0.0000
## Max. :10302 Max. :1.0000 Max. :107586 Max. :4.0000
##
## AGE HOMEKIDS YOJ INCOME
## Min. :16.00 Min. :0.0000 Min. : 0.0 Min. : 0
## 1st Qu.:39.00 1st Qu.:0.0000 1st Qu.: 9.0 1st Qu.: 28097
## Median :45.00 Median :0.0000 Median :11.0 Median : 54028
## Mean :44.79 Mean :0.7212 Mean :10.5 Mean : 61898
## 3rd Qu.:51.00 3rd Qu.:1.0000 3rd Qu.:13.0 3rd Qu.: 85986
## Max. :81.00 Max. :5.0000 Max. :23.0 Max. :367030
## NA's :6 NA's :454 NA's :445
## PARENT1 HOME_VAL MSTATUS SEX
## Length:8161 Min. : 0 Length:8161 Length:8161
## Class :character 1st Qu.: 0 Class :character Class :character
## Mode :character Median :161160 Mode :character Mode :character
## Mean :154867
## 3rd Qu.:238724
## Max. :885282
## NA's :464
## EDUCATION JOB TRAVTIME CAR_USE
## Length:8161 Length:8161 Min. : 5.00 Length:8161
## Class :character Class :character 1st Qu.: 22.00 Class :character
## Mode :character Mode :character Median : 33.00 Mode :character
## Mean : 33.49
## 3rd Qu.: 44.00
## Max. :142.00
##
## BLUEBOOK TIF CAR_TYPE RED_CAR
## Min. : 1500 Min. : 1.000 Length:8161 Length:8161
## 1st Qu.: 9280 1st Qu.: 1.000 Class :character Class :character
## Median :14440 Median : 4.000 Mode :character Mode :character
## Mean :15710 Mean : 5.351
## 3rd Qu.:20850 3rd Qu.: 7.000
## Max. :69740 Max. :25.000
##
## OLDCLAIM CLM_FREQ REVOKED MVR_PTS
## Min. : 0 Min. :0.0000 Length:8161 Min. : 0.000
## 1st Qu.: 0 1st Qu.:0.0000 Class :character 1st Qu.: 0.000
## Median : 0 Median :0.0000 Mode :character Median : 1.000
## Mean : 4037 Mean :0.7986 Mean : 1.696
## 3rd Qu.: 4636 3rd Qu.:2.0000 3rd Qu.: 3.000
## Max. :57037 Max. :5.0000 Max. :13.000
##
## CAR_AGE URBANICITY
## Min. :-3.000 Length:8161
## 1st Qu.: 1.000 Class :character
## Median : 8.000 Mode :character
## Mean : 8.328
## 3rd Qu.:12.000
## Max. :28.000
## NA's :510
#Standard deviations of variables
sapply(training[,c(3:8, 10, 15, 17:18, 21:22, 24:25)], sd)
## TARGET_AMT KIDSDRIV AGE HOMEKIDS YOJ INCOME
## 4704.0269298 0.5115341 NA 1.1163233 NA NA
## HOME_VAL TRAVTIME BLUEBOOK TIF OLDCLAIM CLM_FREQ
## NA 15.9083334 8419.7340755 4.1466353 8777.1391044 1.1584527
## MVR_PTS CAR_AGE
## 2.1471117 NA
#Univariate plots using histograms, kernel density estimates and sorted data plotted against its index for the 14 numeric variables.
#par(mfrow=c(,10))
#If car was in a crash, what was the cost
hist(training$TARGET_AMT, xlab = "Cost if car was in a crash", main = "")
plot(density(training$TARGET_AMT, na.rm = TRUE), main = "")
plot(sort(training$TARGET_AMT), ylab = "Cost if car was in a crash")
boxplot(TARGET_AMT ~ TARGET_FLAG, training)
#Number of driving children
hist(training$KIDSDRIV, xlab = "Number of driving children", main = "")
plot(density(training$KIDSDRIV, na.rm = TRUE), main = "")
plot(sort(training$KIDSDRIV), ylab = "Number of driving children")
boxplot(KIDSDRIV ~ TARGET_FLAG, training)
#Age of driver
hist(training$AGE, xlab = "Age of driver", main = "")
plot(density(training$AGE, na.rm = TRUE), main = "")
plot(sort(training$AGE), ylab = "Age of driver")
boxplot(AGE ~ TARGET_FLAG, training)
#Number of children at home
hist(training$HOMEKIDS, xlab = "Number of children at home", main = "")
plot(density(training$HOMEKIDS, na.rm = TRUE), main = "")
plot(sort(training$HOMEKIDS), ylab = "Number of children at home")
boxplot(HOMEKIDS ~ TARGET_FLAG, training)
#Years on job
hist(training$YOJ, xlab = "Years on job", main = "")
plot(density(training$YOJ, na.rm = TRUE), main = "")
plot(sort(training$YOJ), ylab = "Years on job")
boxplot(YOJ ~ TARGET_FLAG, training)
#Income
hist(training$INCOME, xlab = "Income", main = "")
plot(density(training$INCOME, na.rm = TRUE), main = "")
plot(sort(training$INCOME), ylab = "Income")
boxplot(INCOME ~ TARGET_FLAG, training)
#Home Value
hist(training$HOME_VAL, xlab = "Home Value", main = "")
plot(density(training$HOME_VAL, na.rm = TRUE), main = "")
plot(sort(training$HOME_VAL), ylab = "Home Value")
boxplot(HOME_VAL ~ TARGET_FLAG, training)
#Travel Time
hist(training$TRAVTIME, xlab = "Travel Time", main = "")
plot(density(training$TRAVTIME, na.rm = TRUE), main = "")
plot(sort(training$TRAVTIME), ylab = "Travel Time")
boxplot(TRAVTIME ~ TARGET_FLAG, training)
#Bluebook value of vehicle
hist(training$BLUEBOOK, xlab = "Bluebook value of vehicle", main = "")
plot(density(training$BLUEBOOK, na.rm = TRUE), main = "")
plot(sort(training$BLUEBOOK), ylab = "Bluebook value of vehicle")
boxplot(BLUEBOOK ~ TARGET_FLAG, training)
#Time in Force
hist(training$TIF, xlab = "Time in Force", main = "")
plot(density(training$TIF, na.rm = TRUE), main = "")
plot(sort(training$TIF), ylab = "Time in Force")
boxplot(TIF ~ TARGET_FLAG, training)
#Old Claims
hist(training$OLDCLAIM, xlab = "Total Claims", main = "")
plot(density(training$OLDCLAIM, na.rm = TRUE), main = "")
plot(sort(training$OLDCLAIM), ylab = "Total Claims")
boxplot(OLDCLAIM ~ TARGET_FLAG, training)
#Number of claims
hist(training$CLM_FREQ, xlab = "Number of claims", main = "")
plot(density(training$CLM_FREQ, na.rm = TRUE), main = "")
plot(sort(training$CLM_FREQ), ylab = "Number of claims")
boxplot(CLM_FREQ ~ TARGET_FLAG, training)
#Motor Vehicle Record Points
hist(training$MVR_PTS, xlab = "Motor Vehicle Record Points", main = "")
plot(density(training$MVR_PTS, na.rm = TRUE), main = "")
plot(sort(training$MVR_PTS), ylab = "Motor Vehicle Record Points")
boxplot(MVR_PTS ~ TARGET_FLAG, training)
#Vehicle age
hist(training$CAR_AGE, xlab = "Vehicle age", main = "")
plot(density(training$CAR_AGE, na.rm = TRUE), main = "")
plot(sort(training$CAR_AGE), ylab = "Vehicle age")
boxplot(CAR_AGE ~ TARGET_FLAG, training)
#Instead of using scatterplots for each of the 15 variables against each other, I used the correlation matrix.
df_new <- (training[,c(2, 3:8, 10, 15, 17:18, 21:22, 24:25)])
p.mat <- cor(df_new, use = "na.or.complete")
corrplot(p.mat, method = 'number', type = 'lower', diag = FALSE, number.cex = 0.6, tl.cex = 0.6, cl.cex = 0.6)
Describe how you have transformed the data by changing the original variables or creating new variables. If you did transform the data or create new variables, discuss why you did this. Here are some possible transformations. a. Fix missing values (maybe with a Mean or Median value) b. Create flags to suggest if a variable was missing c. Transform data by putting it into buckets d. Mathematical transforms such as log or square root (or use Box-Cox) e. Combine variables (such as ratios or adding or multiplying) to create new variables
As observed in the data exploration section, there are some variables with missing NA values. AGE has 6 (0.1%) records that have NA values, YOJ has 454 (5.6%) records that have NA values, INCOME has 445 (5.5%) records that have NA values, HOME_VAL has 464 (5.7%) records that have NA values, CAR_AGE has 510 (6.2%) records that have NA values. We fill the missing values for these variables with the median values.
Additionally, in the data exploration section, we observed that there are some variables such as INCOME, HOME_VAL, BLUEBOOK and OLDCLAIM which should be numeric but have come through as characters and we converted such variables to numeric class to allow analysis.
We also calculate the variance inflation factor (vif) of each variable, which measures the strength and correlation between the predictor variables in a regression model. A value > 5 indicates potentially severe correlation between a given predictor variable and other predictors in the model. Based on this threshold and the calculated vif scores below, we decide to eliminate three predictors - EDUCATION (vif score: 10.491987), JOB (vif score: 23.948522) and CAR_TYPE (vif score: 5.568248) from the model. Finally we consider log transformation on the positively skewed variables - Number of driving children (KIDSDRIV), Number of children at home (HOMEKIDS), Income (INCOME), Home Value (HOME_VAL), Travel Time (TRAVTIME), Bluebook value of vehicle (BLUEBOOK), Time in Force (TIF), Total Claims (OLDCLAIM), Motor Vehicle Record Points (MVR_PTS) and Vehicle Age (CAR_AGE).
The target variable is binomially distributed and we cannot use a linear model since a linear model requires that the errors be approximately normally distributed and furthermore the variance of a binomial variable is not constant which violates another crucial assumption of the linear model. Hence, in the next phase, we experiment with various binomial models.
require(MASS)
## Loading required package: MASS
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
require(car)
## Loading required package: car
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
require(kableExtra)
## Loading required package: kableExtra
##
## Attaching package: 'kableExtra'
## The following object is masked from 'package:dplyr':
##
## group_rows
require(dplyr)
sapply(training, function(x) sum(is.na(x))) %>% kable() %>% kable_styling()
| x | |
|---|---|
| INDEX | 0 |
| TARGET_FLAG | 0 |
| TARGET_AMT | 0 |
| KIDSDRIV | 0 |
| AGE | 6 |
| HOMEKIDS | 0 |
| YOJ | 454 |
| INCOME | 445 |
| PARENT1 | 0 |
| HOME_VAL | 464 |
| MSTATUS | 0 |
| SEX | 0 |
| EDUCATION | 0 |
| JOB | 0 |
| TRAVTIME | 0 |
| CAR_USE | 0 |
| BLUEBOOK | 0 |
| TIF | 0 |
| CAR_TYPE | 0 |
| RED_CAR | 0 |
| OLDCLAIM | 0 |
| CLM_FREQ | 0 |
| REVOKED | 0 |
| MVR_PTS | 0 |
| CAR_AGE | 510 |
| URBANICITY | 0 |
##Fill missing records with missing values
training <- training %>% mutate_at(vars(c("AGE", "YOJ", "INCOME", "HOME_VAL", "CAR_AGE")), ~ifelse(is.na(.), median(., na.rm = TRUE), .))
lmod <- lm(TARGET_FLAG ~., training)
summary(lmod)
##
## Call:
## lm(formula = TARGET_FLAG ~ ., data = training)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.8474 -0.2217 -0.0888 0.1519 1.0779
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.452e-01 4.199e-02 5.840 5.42e-09 ***
## INDEX 3.646e-07 1.263e-06 0.289 0.772887
## TARGET_AMT 4.156e-05 8.273e-07 50.238 < 2e-16 ***
## KIDSDRIV 4.743e-02 8.442e-03 5.619 1.98e-08 ***
## AGE -3.442e-04 5.271e-04 -0.653 0.513708
## HOMEKIDS 3.418e-03 4.873e-03 0.701 0.483136
## YOJ -1.980e-03 1.125e-03 -1.760 0.078455 .
## INCOME -1.932e-07 1.346e-07 -1.435 0.151445
## PARENT1Yes 5.249e-02 1.507e-02 3.483 0.000499 ***
## HOME_VAL -1.456e-07 4.405e-08 -3.306 0.000952 ***
## MSTATUSz_No 4.626e-02 1.081e-02 4.279 1.90e-05 ***
## SEXz_F 1.074e-03 1.371e-02 0.078 0.937531
## EDUCATIONBachelors -4.644e-02 1.527e-02 -3.041 0.002364 **
## EDUCATIONMasters -3.584e-02 2.237e-02 -1.602 0.109112
## EDUCATIONPhD -3.681e-02 2.654e-02 -1.387 0.165556
## EDUCATIONz_High School 9.713e-03 1.282e-02 0.758 0.448709
## JOBClerical 7.181e-02 2.546e-02 2.820 0.004811 **
## JOBDoctor -1.804e-02 3.048e-02 -0.592 0.553944
## JOBHome Maker 5.856e-02 2.719e-02 2.154 0.031300 *
## JOBLawyer 1.731e-02 2.205e-02 0.785 0.432413
## JOBManager -4.284e-02 2.151e-02 -1.991 0.046468 *
## JOBProfessional 3.017e-02 2.302e-02 1.310 0.190118
## JOBStudent 6.065e-02 2.788e-02 2.175 0.029668 *
## JOBz_Blue Collar 5.786e-02 2.400e-02 2.411 0.015938 *
## TRAVTIME 1.500e-03 2.405e-04 6.236 4.72e-10 ***
## CAR_USEPrivate -8.738e-02 1.228e-02 -7.118 1.19e-12 ***
## BLUEBOOK -3.225e-06 6.430e-07 -5.016 5.38e-07 ***
## TIF -5.928e-03 9.089e-04 -6.522 7.36e-11 ***
## CAR_TYPEPanel Truck 4.463e-02 2.074e-02 2.152 0.031442 *
## CAR_TYPEPickup 5.556e-02 1.273e-02 4.364 1.29e-05 ***
## CAR_TYPESports Car 1.007e-01 1.626e-02 6.193 6.18e-10 ***
## CAR_TYPEVan 5.199e-02 1.591e-02 3.269 0.001084 **
## CAR_TYPEz_SUV 7.245e-02 1.338e-02 5.413 6.36e-08 ***
## RED_CARyes -1.443e-03 1.111e-02 -0.130 0.896718
## OLDCLAIM -1.946e-06 5.546e-07 -3.509 0.000451 ***
## CLM_FREQ 2.654e-02 4.106e-03 6.464 1.08e-10 ***
## REVOKEDYes 1.309e-01 1.295e-02 10.107 < 2e-16 ***
## MVR_PTS 1.397e-02 1.938e-03 7.208 6.18e-13 ***
## CAR_AGE 7.772e-04 9.538e-04 0.815 0.415213
## URBANICITYz_Highly Rural/ Rural -2.285e-01 1.048e-02 -21.798 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3388 on 8121 degrees of freedom
## Multiple R-squared: 0.4118, Adjusted R-squared: 0.409
## F-statistic: 145.8 on 39 and 8121 DF, p-value: < 2.2e-16
vif(lmod)
## GVIF Df GVIF^(1/(2*Df))
## INDEX 1.006790 1 1.003389
## TARGET_AMT 1.076470 1 1.037531
## KIDSDRIV 1.325523 1 1.151313
## AGE 1.468762 1 1.211925
## HOMEKIDS 2.103905 1 1.450484
## YOJ 1.423873 1 1.193261
## INCOME 2.761398 1 1.661746
## PARENT1 1.849681 1 1.360030
## HOME_VAL 2.169271 1 1.472845
## MSTATUS 1.994638 1 1.412317
## SEX 3.321252 1 1.822430
## EDUCATION 10.491987 4 1.341551
## JOB 23.948522 8 1.219565
## TRAVTIME 1.040740 1 1.020166
## CAR_USE 2.500501 1 1.581297
## BLUEBOOK 2.083593 1 1.443466
## TIF 1.009773 1 1.004875
## CAR_TYPE 5.568248 5 1.187331
## RED_CAR 1.813024 1 1.346486
## OLDCLAIM 1.684062 1 1.297714
## CLM_FREQ 1.608148 1 1.268128
## REVOKED 1.281171 1 1.131888
## MVR_PTS 1.230716 1 1.109376
## CAR_AGE 1.970736 1 1.403829
## URBANICITY 1.271210 1 1.127480
trainingdata <- training
#Log transformation on number of driving children
trainingdata$KIDSDRIV <- log(trainingdata$KIDSDRIV + 1)
#Log transformation on number of children at home
trainingdata$HOMEKIDS <- log(trainingdata$HOMEKIDS + 1)
#Log transformation on number of income
trainingdata$INCOME <- log(trainingdata$INCOME + 1)
#Log transformation on home value
trainingdata$HOME_VAL <- log(trainingdata$HOME_VAL + 1)
#Log transformation on travel time
trainingdata$TRAVTIME <- log(trainingdata$TRAVTIME + 1)
#Log transformation on bluebook value of vehicle
trainingdata$BLUEBOOK <- log(trainingdata$BLUEBOOK + 1)
#Log transformation on time in Force
trainingdata$TIF <- log(trainingdata$TIF + 1)
#Log transformation on old claims
trainingdata$OLDCLAIM <- log(trainingdata$OLDCLAIM + 1)
#Log transformation on motor vehicle Record Points
trainingdata$MVR_PTS <- log(trainingdata$MVR_PTS + 1)
#Log transformation on motor vehicle age
trainingdata$CAR_AGE <- log(trainingdata$CAR_AGE + 1)
## Warning in log(trainingdata$CAR_AGE + 1): NaNs produced
Using the training data set, build at least two different multiple linear regression models and three different binary logistic regression models, using different variables (or the same variables with different transformations). You may select the variables manually, use an approach such as Forward or Stepwise, use a different approach such as trees, or use a combination of techniques. Describe the techniques you used. If you manually selected a variable for inclusion into the model or exclusion into the model, indicate why this was done.
Discuss the coefficients in the models, do they make sense? For example, if a person has a lot of traffic tickets, you would reasonably expect that person to have more car crashes. If the coefficient is negative (suggesting that the person is a safer driver), then that needs to be discussed. Are you keeping the model even though it is counter intuitive? Why? The boss needs to know.
For multiple linear regression model 1, we use all the predictor variables except INDEX and TARGET_FLAG to predict TARGET_AMT. We notice that KIDSDRIV, INCOME, PARENT1Yes, MSTATUSz_No, SEXz_F, TRAVTIME, CAR_USEPrivate, TIF, CAR_TYPE, CLM_FREQ, REVOKEDYes, MVR_PTS, CAR-AGE and URBANICITYz_Highly Rural have significant p-values. The adjusted R-sqaured of this basic multiple linear regression model is 6.681%.
For multiple linear regression model 2, we use the following as predictor variables - INCOME, YOJ, CLM_FREQ, MVR_PTS and BLUEBOOK. I selected these variables manually by looking at the correlated variable pairs from the data exploration section and selecting one variable each among the highest correlated pairs. Intuitively, one would expect that drivers with higher incomes and more number of years of employment would be more cautious driving their cars resulting in fewer accidents and fewer claims as these drivers are in a good position financially and don’t want to jeopardize that. On the contrary, drivers with a prior record of claims and/or existing motor vehicle record points are likely to be more involved in accidents as these drivers have little to loose from driving rashly. We see from model 2, that all of the selected predictor variables are significant in the model except YOJ. The adjusted R-squared of this parismonious multiple linear regression model however is worse than the full model at 2.567%.
For multiple linear regression model 3, we use variables which theoretically should have an impact on the amount claimed if a car is involved in a crash as identified in the problem - AGE, CLM_FREQ, HOME_VAL, INCOME, JOB, KIDSDRIV, MSTATUS, MVR_RPTS, OLDCLAIM, RED_CAR, REVOKED, SEX, TIF, TRAVTIME and YOJ. We see from model 3, that all of the selected predictor variables are significant in the model except AGE, HOME_VAL, JOBClerical, JOBHome Maker, JOBProfessional, JOBStudent, JOBz_Blue Collar, RED_CARyes, OLDCLAIM, SEXz_F, TRAVTIME and YOJ. The adjusted R-sqaured of this theoretical multiple linear regression model is only 4.224%.
I also ran a multiple linear regression model (model 8) with transformed independent variables excluding TARGET_FLAG and INDEX but this model had an adjusted R-squared value of only 6.512%, better than models 2 and 3 but slightly worse than the baseline multiple linear regression model.
Regarding the binary logistics regression models, we select a basic logit model for model 4 that includes all predictor variables in their untransformed state excluding the TARGET_AMT since this variable will have values only when the TARGET_FLAG is 1 (i.e. car is in a crash). We also exclude the INDEX variable since this doesn’t have any explanatory value. From this initial model, we observe that KIDSDRIV, INCOME, PARENT1Yes, HOME_VAL, MSTATUSz_No, EDUCATIONBachelors, JOBClerical, JOBManager, TRAVTIME, CAR_USEPrivate, BLUEBOOK, TIF, CAR_TYPE, OLDCLAIM, CLM_FREQ, REVOKEDYes, MVR_PTS, CAR_AGE and URBANICITY have significant p-values and these are some of the very same variables that were found to be significant in the basic linear model 1. To test the goodness of fit, we assume that the null hypothesis specifies that the model is correctly specified. We pass the residual deviance along with the model degrees of freedom to pchisq and find that there is no evidence to reject the null hypothesis that the model fits for model 4. Deviance < (n-p) = 7297.6 < (8123 - 38) = 7297.6 < 8085
For model 5, we deploy a logit model but use the variables used in model 2. However we observe that there is evidence to reject the null hypothesis that the model fits. Deviance > (n-p) = 8728.1 > (8155 - 6 = 8149)
For model 6, we use variables which theoretically should have an impact on the probability of car crashes and the amount claimed if a car is involved in a crash as identified in the problem - AGE, CLM_FREQ, HOME_VAL, INCOME, JOB, KIDSDRIV, MSTATUS, MVR_RPTS, OLDCLAIM, RED_CAR, REVOKED, SEX, TIF, TRAVTIME, YOJ. Again, we observe similarly as we did with model 4 that there is some evidence to reject the null hypothesis that the model fits. Deviance > (n-p) = 8176.4 > (8138 - 23 = 8115). AIC for models 5 and 6 increases respectively to 8740.1 and 8222.4 from 7373.6 (model 4 AIC).
Finally, for model 7, we use log transformed variables as developed in the data preparation phase above and exlude TARGET_AMT and INDEX as we did with model 4 which was our baseline logit model. From this model that uses transformed values for positively skewed variables, we observe that KIDSDRIV, INCOME, PARENT1Yes, HOME_VAL, MSTATUSz_No, EDUCATIONBachelors, JOBClerical, JOBManager, JOBz_Blue Collar, TRAVTIME, CAR_USEPrivate, BLUEBOOK, TIF, CAR_TYPE, OLDCLAIM, CLM_FREQ, REVOKEDYes, MVR_PTS and URBANICITY have significant p-values and these are some of the very same variables that were found to be significant in the basic logit model 4 and also in the basic linear model 1. Finally, we pass the residual deviance along with the model degrees of freedom to pchisq and find that there is no evidence to reject the null hypothesis and find similarly as we did with baseline logit model 4 that the model fits for model 7. Furthermore the AIC is improved slightly to 7363.1 from baseline logit model 4’s AIC value of 7373.6.
We decide to use model 7 as our final logit model and model 1 as our baseline multiple linear regression model for evaluation purposes.
require(MASS)
require(car)
require(kableExtra)
require(dplyr)
##Model 1:
lmod1 <- lm(TARGET_AMT ~ .-INDEX - TARGET_FLAG, training)
summary(lmod1)
##
## Call:
## lm(formula = TARGET_AMT ~ . - INDEX - TARGET_FLAG, data = training)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5891 -1698 -760 344 103785
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.065e+03 5.572e+02 1.912 0.05595 .
## KIDSDRIV 3.142e+02 1.132e+02 2.777 0.00551 **
## AGE 5.220e+00 7.066e+00 0.739 0.46005
## HOMEKIDS 7.764e+01 6.535e+01 1.188 0.23487
## YOJ -3.952e+00 1.509e+01 -0.262 0.79336
## INCOME -4.412e-03 1.805e-03 -2.444 0.01453 *
## PARENT1Yes 5.762e+02 2.020e+02 2.852 0.00435 **
## HOME_VAL -5.580e-04 5.908e-04 -0.945 0.34492
## MSTATUSz_No 5.701e+02 1.449e+02 3.935 8.38e-05 ***
## SEXz_F -3.688e+02 1.838e+02 -2.007 0.04481 *
## EDUCATIONBachelors -2.590e+02 2.048e+02 -1.265 0.20598
## EDUCATIONMasters 2.347e+01 2.999e+02 0.078 0.93764
## EDUCATIONPhD 2.863e+02 3.559e+02 0.804 0.42127
## EDUCATIONz_High School -8.888e+01 1.719e+02 -0.517 0.60520
## JOBClerical 5.293e+02 3.414e+02 1.550 0.12110
## JOBDoctor -4.997e+02 4.087e+02 -1.223 0.22151
## JOBHome Maker 3.524e+02 3.646e+02 0.967 0.33380
## JOBLawyer 2.308e+02 2.956e+02 0.781 0.43498
## JOBManager -4.787e+02 2.885e+02 -1.660 0.09704 .
## JOBProfessional 4.565e+02 3.088e+02 1.478 0.13936
## JOBStudent 2.876e+02 3.740e+02 0.769 0.44180
## JOBz_Blue Collar 5.077e+02 3.218e+02 1.577 0.11473
## TRAVTIME 1.195e+01 3.222e+00 3.709 0.00021 ***
## CAR_USEPrivate -7.798e+02 1.644e+02 -4.743 2.14e-06 ***
## BLUEBOOK 1.433e-02 8.623e-03 1.662 0.09647 .
## TIF -4.820e+01 1.218e+01 -3.958 7.63e-05 ***
## CAR_TYPEPanel Truck 2.648e+02 2.782e+02 0.952 0.34112
## CAR_TYPEPickup 3.754e+02 1.707e+02 2.200 0.02786 *
## CAR_TYPESports Car 1.022e+03 2.178e+02 4.693 2.74e-06 ***
## CAR_TYPEVan 5.145e+02 2.132e+02 2.413 0.01584 *
## CAR_TYPEz_SUV 7.518e+02 1.793e+02 4.193 2.78e-05 ***
## RED_CARyes -4.820e+01 1.491e+02 -0.323 0.74641
## OLDCLAIM -1.057e-02 7.436e-03 -1.421 0.15527
## CLM_FREQ 1.417e+02 5.503e+01 2.575 0.01005 *
## REVOKEDYes 5.494e+02 1.735e+02 3.166 0.00155 **
## MVR_PTS 1.753e+02 2.592e+01 6.765 1.43e-11 ***
## CAR_AGE -2.681e+01 1.279e+01 -2.097 0.03606 *
## URBANICITYz_Highly Rural/ Rural -1.665e+03 1.394e+02 -11.943 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4544 on 8123 degrees of freedom
## Multiple R-squared: 0.07104, Adjusted R-squared: 0.06681
## F-statistic: 16.79 on 37 and 8123 DF, p-value: < 2.2e-16
##Model 2:
lmod2 <- lm(TARGET_AMT ~ INCOME + YOJ + CLM_FREQ + MVR_PTS + BLUEBOOK, training)
summary(lmod2)
##
## Call:
## lm(formula = TARGET_AMT ~ INCOME + YOJ + CLM_FREQ + MVR_PTS +
## BLUEBOOK, data = training)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4328 -1519 -985 -361 104225
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.037e+03 1.735e+02 5.978 2.35e-09 ***
## INCOME -5.763e-03 1.259e-03 -4.576 4.81e-06 ***
## YOJ -3.854e+00 1.343e+01 -0.287 0.7741
## CLM_FREQ 2.939e+02 4.836e+01 6.078 1.28e-09 ***
## MVR_PTS 2.336e+02 2.611e+01 8.947 < 2e-16 ***
## BLUEBOOK 1.471e-02 6.728e-03 2.186 0.0288 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4643 on 8155 degrees of freedom
## Multiple R-squared: 0.02627, Adjusted R-squared: 0.02567
## F-statistic: 44 on 5 and 8155 DF, p-value: < 2.2e-16
##Model 3:
lmod3 <- lm(TARGET_AMT ~ AGE + CLM_FREQ + HOME_VAL + INCOME + JOB + KIDSDRIV + MSTATUS + MVR_PTS + OLDCLAIM + RED_CAR + REVOKED + SEX + TIF + TRAVTIME + YOJ, training)
summary(lmod3)
##
## Call:
## lm(formula = TARGET_AMT ~ AGE + CLM_FREQ + HOME_VAL + INCOME +
## JOB + KIDSDRIV + MSTATUS + MVR_PTS + OLDCLAIM + RED_CAR +
## REVOKED + SEX + TIF + TRAVTIME + YOJ, data = training)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4892 -1622 -885 73 104539
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.410e+03 4.578e+02 3.080 0.002076 **
## AGE -2.536e+00 6.271e+00 -0.404 0.685992
## CLM_FREQ 2.915e+02 5.456e+01 5.343 9.39e-08 ***
## HOME_VAL -3.655e-04 5.966e-04 -0.613 0.540190
## INCOME -3.512e-03 1.698e-03 -2.068 0.038685 *
## JOBClerical -4.402e+02 2.731e+02 -1.612 0.107013
## JOBDoctor -1.025e+03 3.599e+02 -2.847 0.004428 **
## JOBHome Maker -6.426e+02 3.299e+02 -1.948 0.051497 .
## JOBLawyer -6.608e+02 2.645e+02 -2.498 0.012517 *
## JOBManager -1.024e+03 2.554e+02 -4.010 6.13e-05 ***
## JOBProfessional -4.283e+02 2.536e+02 -1.689 0.091353 .
## JOBStudent -3.898e+02 3.211e+02 -1.214 0.224887
## JOBz_Blue Collar 3.704e+01 2.470e+02 0.150 0.880802
## KIDSDRIV 4.057e+02 1.008e+02 4.026 5.73e-05 ***
## MSTATUSz_No 7.310e+02 1.271e+02 5.752 9.13e-09 ***
## MVR_PTS 2.119e+02 2.612e+01 8.114 5.60e-16 ***
## OLDCLAIM -1.158e-02 7.524e-03 -1.538 0.123984
## RED_CARyes -5.204e+01 1.508e+02 -0.345 0.730041
## REVOKEDYes 7.558e+02 1.751e+02 4.317 1.60e-05 ***
## SEXz_F -1.174e+02 1.412e+02 -0.832 0.405685
## TIF -4.541e+01 1.233e+01 -3.684 0.000231 ***
## TRAVTIME 5.846e+00 3.224e+00 1.813 0.069861 .
## YOJ 3.011e+00 1.492e+01 0.202 0.840115
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4604 on 8138 degrees of freedom
## Multiple R-squared: 0.04482, Adjusted R-squared: 0.04224
## F-statistic: 17.36 on 22 and 8138 DF, p-value: < 2.2e-16
## Model 8:
lmod4 <- lm(TARGET_AMT ~ .- INDEX - TARGET_FLAG, data = trainingdata)
summary(lmod4)
##
## Call:
## lm(formula = TARGET_AMT ~ . - INDEX - TARGET_FLAG, data = trainingdata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5524 -1692 -767 364 104245
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -381.997 1134.865 -0.337 0.736426
## KIDSDRIV 594.482 203.775 2.917 0.003540 **
## AGE 4.778 7.205 0.663 0.507297
## HOMEKIDS 169.279 145.208 1.166 0.243742
## YOJ 11.914 18.547 0.642 0.520633
## INCOME -60.141 30.622 -1.964 0.049569 *
## PARENT1Yes 535.818 209.094 2.563 0.010408 *
## HOME_VAL -14.259 12.554 -1.136 0.256058
## MSTATUSz_No 573.026 151.273 3.788 0.000153 ***
## SEXz_F -373.629 179.387 -2.083 0.037299 *
## EDUCATIONBachelors -354.551 203.742 -1.740 0.081862 .
## EDUCATIONMasters -188.315 287.684 -0.655 0.512750
## EDUCATIONPhD -65.879 334.482 -0.197 0.843866
## EDUCATIONz_High School -119.226 171.815 -0.694 0.487752
## JOBClerical 654.016 336.893 1.941 0.052255 .
## JOBDoctor -490.708 409.119 -1.199 0.230397
## JOBHome Maker 426.700 370.213 1.153 0.249116
## JOBLawyer 284.819 295.306 0.964 0.334830
## JOBManager -454.536 288.345 -1.576 0.114981
## JOBProfessional 507.414 308.363 1.646 0.099904 .
## JOBStudent 278.878 389.861 0.715 0.474427
## JOBz_Blue Collar 569.229 320.905 1.774 0.076130 .
## TRAVTIME 338.622 89.181 3.797 0.000148 ***
## CAR_USEPrivate -786.794 164.620 -4.779 1.79e-06 ***
## BLUEBOOK 148.802 102.670 1.449 0.147286
## TIF -288.216 71.635 -4.023 5.79e-05 ***
## CAR_TYPEPanel Truck 278.092 263.019 1.057 0.290402
## CAR_TYPEPickup 385.536 170.576 2.260 0.023835 *
## CAR_TYPESports Car 1030.863 216.038 4.772 1.86e-06 ***
## CAR_TYPEVan 503.431 212.436 2.370 0.017821 *
## CAR_TYPEz_SUV 758.477 174.236 4.353 1.36e-05 ***
## RED_CARyes -47.304 149.203 -0.317 0.751220
## OLDCLAIM 21.153 24.008 0.881 0.378306
## CLM_FREQ 67.172 85.722 0.784 0.433296
## REVOKEDYes 425.330 156.851 2.712 0.006709 **
## MVR_PTS 421.208 76.679 5.493 4.07e-08 ***
## CAR_AGE -143.949 82.247 -1.750 0.080121 .
## URBANICITYz_Highly Rural/ Rural -1640.568 139.941 -11.723 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4549 on 8122 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.06936, Adjusted R-squared: 0.06512
## F-statistic: 16.36 on 37 and 8122 DF, p-value: < 2.2e-16
##Model 4:
gmod1 <- glm(formula = TARGET_FLAG ~ .- INDEX - TARGET_AMT, family = binomial(link = "logit"), data = training)
summary(gmod1)
##
## Call:
## glm(formula = TARGET_FLAG ~ . - INDEX - TARGET_AMT, family = binomial(link = "logit"),
## data = training)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.5850 -0.7127 -0.3983 0.6264 3.1526
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -9.274e-01 3.215e-01 -2.885 0.003916 **
## KIDSDRIV 3.862e-01 6.122e-02 6.308 2.82e-10 ***
## AGE -1.013e-03 4.020e-03 -0.252 0.801103
## HOMEKIDS 4.965e-02 3.713e-02 1.337 0.181171
## YOJ -1.106e-02 8.582e-03 -1.289 0.197521
## INCOME -3.421e-06 1.082e-06 -3.163 0.001562 **
## PARENT1Yes 3.820e-01 1.096e-01 3.485 0.000492 ***
## HOME_VAL -1.307e-06 3.420e-07 -3.821 0.000133 ***
## MSTATUSz_No 4.938e-01 8.358e-02 5.908 3.46e-09 ***
## SEXz_F -8.245e-02 1.120e-01 -0.736 0.461749
## EDUCATIONBachelors -3.794e-01 1.156e-01 -3.281 0.001034 **
## EDUCATIONMasters -2.867e-01 1.787e-01 -1.604 0.108742
## EDUCATIONPhD -1.641e-01 2.139e-01 -0.767 0.442943
## EDUCATIONz_High School 1.799e-02 9.505e-02 0.189 0.849879
## JOBClerical 4.108e-01 1.967e-01 2.089 0.036709 *
## JOBDoctor -4.457e-01 2.671e-01 -1.669 0.095163 .
## JOBHome Maker 2.325e-01 2.102e-01 1.106 0.268573
## JOBLawyer 1.050e-01 1.695e-01 0.619 0.535705
## JOBManager -5.572e-01 1.715e-01 -3.248 0.001163 **
## JOBProfessional 1.619e-01 1.784e-01 0.908 0.364065
## JOBStudent 2.163e-01 2.145e-01 1.008 0.313345
## JOBz_Blue Collar 3.107e-01 1.856e-01 1.674 0.094041 .
## TRAVTIME 1.457e-02 1.883e-03 7.736 1.03e-14 ***
## CAR_USEPrivate -7.564e-01 9.172e-02 -8.247 < 2e-16 ***
## BLUEBOOK -2.084e-05 5.263e-06 -3.960 7.50e-05 ***
## TIF -5.546e-02 7.344e-03 -7.553 4.27e-14 ***
## CAR_TYPEPanel Truck 5.607e-01 1.618e-01 3.466 0.000529 ***
## CAR_TYPEPickup 5.540e-01 1.007e-01 5.500 3.81e-08 ***
## CAR_TYPESports Car 1.025e+00 1.299e-01 7.892 2.97e-15 ***
## CAR_TYPEVan 6.185e-01 1.265e-01 4.890 1.01e-06 ***
## CAR_TYPEz_SUV 7.682e-01 1.113e-01 6.904 5.06e-12 ***
## RED_CARyes -9.692e-03 8.636e-02 -0.112 0.910644
## OLDCLAIM -1.389e-05 3.910e-06 -3.554 0.000380 ***
## CLM_FREQ 1.960e-01 2.855e-02 6.865 6.66e-12 ***
## REVOKEDYes 8.874e-01 9.133e-02 9.716 < 2e-16 ***
## MVR_PTS 1.133e-01 1.361e-02 8.324 < 2e-16 ***
## CAR_AGE -1.080e-03 7.541e-03 -0.143 0.886064
## URBANICITYz_Highly Rural/ Rural -2.390e+00 1.128e-01 -21.181 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 9418.0 on 8160 degrees of freedom
## Residual deviance: 7297.6 on 8123 degrees of freedom
## AIC: 7373.6
##
## Number of Fisher Scoring iterations: 5
df <- 8123
deviance <- 7297.6
p_val <- pchisq(deviance, df = df, lower.tail = FALSE); p_val
## [1] 1
## Model 5:
gmod2 <- glm(formula = TARGET_FLAG ~ INCOME + YOJ + CLM_FREQ + MVR_PTS + BLUEBOOK, family = binomial(link = "logit"), data = training)
summary(gmod2)
##
## Call:
## glm(formula = TARGET_FLAG ~ INCOME + YOJ + CLM_FREQ + MVR_PTS +
## BLUEBOOK, family = binomial(link = "logit"), data = training)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.7619 -0.7682 -0.6198 1.0140 2.6328
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -8.852e-01 8.506e-02 -10.407 < 2e-16 ***
## INCOME -5.815e-06 7.193e-07 -8.084 6.25e-16 ***
## YOJ -1.166e-02 6.613e-03 -1.763 0.0779 .
## CLM_FREQ 2.859e-01 2.282e-02 12.525 < 2e-16 ***
## MVR_PTS 1.517e-01 1.229e-02 12.338 < 2e-16 ***
## BLUEBOOK -1.538e-05 3.572e-06 -4.307 1.66e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 9418.0 on 8160 degrees of freedom
## Residual deviance: 8728.1 on 8155 degrees of freedom
## AIC: 8740.1
##
## Number of Fisher Scoring iterations: 4
df <- 8155
deviance <- 8728.1
p_val <- pchisq(deviance, df = df, lower.tail = FALSE); p_val
## [1] 5.624213e-06
## Model 6:
gmod3 <- glm(formula = TARGET_FLAG ~ AGE + CLM_FREQ + HOME_VAL + INCOME + JOB + KIDSDRIV + MSTATUS + MVR_PTS + OLDCLAIM + RED_CAR + REVOKED + SEX + TIF + TRAVTIME + YOJ, family = binomial(link = "logit"), data = training)
summary(gmod3)
##
## Call:
## glm(formula = TARGET_FLAG ~ AGE + CLM_FREQ + HOME_VAL + INCOME +
## JOB + KIDSDRIV + MSTATUS + MVR_PTS + OLDCLAIM + RED_CAR +
## REVOKED + SEX + TIF + TRAVTIME + YOJ, family = binomial(link = "logit"),
## data = training)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.0416 -0.7563 -0.5406 0.8256 2.7859
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -7.656e-01 2.444e-01 -3.133 0.001732 **
## AGE -8.852e-03 3.306e-03 -2.677 0.007423 **
## CLM_FREQ 3.263e-01 2.688e-02 12.140 < 2e-16 ***
## HOME_VAL -1.252e-06 3.259e-07 -3.841 0.000123 ***
## INCOME -4.414e-06 9.788e-07 -4.510 6.49e-06 ***
## JOBClerical -2.729e-01 1.455e-01 -1.875 0.060747 .
## JOBDoctor -9.512e-01 2.384e-01 -3.990 6.59e-05 ***
## JOBHome Maker -4.717e-01 1.769e-01 -2.666 0.007686 **
## JOBLawyer -5.707e-01 1.482e-01 -3.852 0.000117 ***
## JOBManager -9.045e-01 1.488e-01 -6.078 1.22e-09 ***
## JOBProfessional -4.326e-01 1.374e-01 -3.148 0.001645 **
## JOBStudent -2.297e-01 1.693e-01 -1.356 0.174998
## JOBz_Blue Collar 1.137e-01 1.303e-01 0.872 0.382955
## KIDSDRIV 3.543e-01 4.921e-02 7.198 6.09e-13 ***
## MSTATUSz_No 4.886e-01 6.654e-02 7.343 2.09e-13 ***
## MVR_PTS 1.418e-01 1.286e-02 11.033 < 2e-16 ***
## OLDCLAIM -1.421e-05 3.722e-06 -3.819 0.000134 ***
## RED_CARyes -1.455e-02 8.127e-02 -0.179 0.857874
## REVOKEDYes 1.004e+00 8.557e-02 11.731 < 2e-16 ***
## SEXz_F 6.642e-02 7.593e-02 0.875 0.381743
## TIF -4.768e-02 6.928e-03 -6.882 5.90e-12 ***
## TRAVTIME 6.000e-03 1.704e-03 3.522 0.000429 ***
## YOJ -5.916e-03 7.775e-03 -0.761 0.446749
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 9418.0 on 8160 degrees of freedom
## Residual deviance: 8176.4 on 8138 degrees of freedom
## AIC: 8222.4
##
## Number of Fisher Scoring iterations: 5
df <- 8138
deviance <- 8176.4
p_val <- pchisq(deviance, df = df, lower.tail = FALSE); p_val
## [1] 0.3798994
## Model 7:
gmod4 <- glm(formula = TARGET_FLAG ~ .- INDEX - TARGET_AMT, family = binomial(link = "logit"), data = trainingdata)
summary(gmod4)
##
## Call:
## glm(formula = TARGET_FLAG ~ . - INDEX - TARGET_AMT, family = binomial(link = "logit"),
## data = trainingdata)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.3811 -0.7113 -0.4017 0.6225 3.1486
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.464062 0.649109 2.255 0.024102 *
## KIDSDRIV 0.707013 0.110763 6.383 1.74e-10 ***
## AGE -0.002552 0.004108 -0.621 0.534481
## HOMEKIDS 0.096804 0.083740 1.156 0.247677
## YOJ 0.018835 0.010844 1.737 0.082410 .
## INCOME -0.093005 0.017662 -5.266 1.40e-07 ***
## PARENT1Yes 0.333787 0.114598 2.913 0.003583 **
## HOME_VAL -0.029651 0.006898 -4.298 1.72e-05 ***
## MSTATUSz_No 0.496122 0.087959 5.640 1.70e-08 ***
## SEXz_F -0.121475 0.108365 -1.121 0.262294
## EDUCATIONBachelors -0.405371 0.114887 -3.528 0.000418 ***
## EDUCATIONMasters -0.346076 0.171257 -2.021 0.043301 *
## EDUCATIONPhD -0.376881 0.201891 -1.867 0.061935 .
## EDUCATIONz_High School 0.030946 0.095237 0.325 0.745228
## JOBClerical 0.524226 0.193680 2.707 0.006796 **
## JOBDoctor -0.386625 0.264350 -1.463 0.143590
## JOBHome Maker 0.130236 0.216936 0.600 0.548277
## JOBLawyer 0.174534 0.168488 1.036 0.300255
## JOBManager -0.516765 0.170062 -3.039 0.002376 **
## JOBProfessional 0.217652 0.177501 1.226 0.220125
## JOBStudent -0.009498 0.223017 -0.043 0.966028
## JOBz_Blue Collar 0.391275 0.184633 2.119 0.034073 *
## TRAVTIME 0.429713 0.054342 7.908 2.63e-15 ***
## CAR_USEPrivate -0.759515 0.092103 -8.246 < 2e-16 ***
## BLUEBOOK -0.310406 0.058910 -5.269 1.37e-07 ***
## TIF -0.315921 0.041411 -7.629 2.37e-14 ***
## CAR_TYPEPanel Truck 0.480174 0.150107 3.199 0.001380 **
## CAR_TYPEPickup 0.584159 0.100635 5.805 6.45e-09 ***
## CAR_TYPESports Car 1.041343 0.128113 8.128 4.35e-16 ***
## CAR_TYPEVan 0.619988 0.125206 4.952 7.36e-07 ***
## CAR_TYPEz_SUV 0.816889 0.107738 7.582 3.40e-14 ***
## RED_CARyes -0.013767 0.086307 -0.160 0.873264
## OLDCLAIM 0.026184 0.012384 2.114 0.034484 *
## CLM_FREQ 0.086246 0.043417 1.986 0.046980 *
## REVOKEDYes 0.698831 0.081492 8.575 < 2e-16 ***
## MVR_PTS 0.281083 0.042153 6.668 2.59e-11 ***
## CAR_AGE -0.020588 0.046974 -0.438 0.661188
## URBANICITYz_Highly Rural/ Rural -2.374662 0.113025 -21.010 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 9415.3 on 8159 degrees of freedom
## Residual deviance: 7287.1 on 8122 degrees of freedom
## (1 observation deleted due to missingness)
## AIC: 7363.1
##
## Number of Fisher Scoring iterations: 5
df <- 8122
deviance <- 7287.1
p_val <- pchisq(deviance, df = df, lower.tail = FALSE); p_val
## [1] 1
Decide on the criteria for selecting the best multiple linear regression model and the best binary logistic regression model. Will you select models with slightly worse performance if it makes more sense or is more parsimonious? Discuss why you selected your models.
For the multiple linear regression model, will you use a metric such as Adjusted R2, RMSE, etc.? Be sure to explain how you can make inferences from the model, discuss multi-collinearity issues (if any), and discuss other relevant model output. Using the training data set, evaluate the multiple linear regression model based on (a) mean squared error, (b) R2, (c) F-statistic, and (d) residual plots.
For the binary logistic regression model, will you use a metric such as log likelihood, AIC, ROC curve, etc.? Using the training data set, evaluate the binary logistic regression model based on (a) accuracy, (b) classification error rate, (c) precision, (d) sensitivity, (e) specificity, (f) F1 score, (g) AUC, and (h) confusion matrix. Make predictions using the evaluation data set.
We run a couple of diagnostics on the selected multiple linear regression model 1 as well as on the other linear regression models devised in the prior step.
The residual vs fitted plots for the baseline full model as well as for the three other models (reduced models 2, 3 and transformed model 4) are not homoscedastic.
Next, we can test the residuals for normality using the Q-Q plot. Residuals in the Q-Q plots for the full as well as reduced models do not follow the line approximately and hence the residuals are not normal. The skew in the Q-Q plot of the full and reduced models persists.
Two index values 2904 and 5730 are identified as leverage points in the full model (model 1).
We identify influential points in both the full as well as in the reduced and in the transformed models and removal of the largest of the influential points in each model doesn’t change the summary stats meaningfully. Model 1 adjusted R-squared improves slightly to 6.897% from 6.681%. Model 2 adjusted R-squared improves slightly to 2.578% from 2.567%. Model 3 adjusted R-squared declines slightly to 4.221% from 4.224% and model 4 adjusted R-squared improves slightly to 6.513% from 6.512%.
Based on the diagnostic plots and the summary analysis, I decide to still use model 1 after removing the influential points as my predictive model which has a F-stat value of ~17, adjusted R-squared of ~7% i.e. explains 7% of variance in the model. The coefficients are quite similar in both models.
This model can be written as:
TARGET_AMT = 0.049466 + (0.003448 * KIDSDRIV) + (0.548381 * AGE) + (0.189934 * HOMEKIDS) + (0.802093 * YOJ) + (0.010331 * INCOME) + (0.004485 * PARENT1Yes) + (0.363614 * HOME_VAL)+….
require(faraway)
## Loading required package: faraway
##
## Attaching package: 'faraway'
## The following objects are masked from 'package:car':
##
## logit, vif
require(tidyverse)
## Loading required package: tidyverse
## ── Attaching packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.5 ✓ readr 1.3.1
## ✓ tibble 3.0.3 ✓ purrr 0.3.4
## ✓ tidyr 1.1.2 ✓ forcats 0.5.0
## ── Conflicts ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x kableExtra::group_rows() masks dplyr::group_rows()
## x dplyr::lag() masks stats::lag()
## x car::recode() masks dplyr::recode()
## x MASS::select() masks dplyr::select()
## x purrr::some() masks car::some()
# Model 1:
par(mfrow = c(2, 2))
plot(lmod1)
#Plot of successive pairs of residuals to check for serial correlation
n1 <- length(residuals(lmod1))
plot(tail(residuals(lmod1), n1-1) ~ head(residuals(lmod1), n1-1), xlab = expression(hat(epsilon)[i]), ylab = expression(hat(epsilon)[i+1]))
abline(h = 0, v = 0, col = grey(0.75))
par(mfrow = c(2, 2))
#Check for leverage points using half-normal plots
hatv1 <- hatvalues(lmod1)
sum(hatv1)
## [1] 38
index <- row.names(training)
halfnorm(hatv1, labs = index, ylab = "Leverages")
## Identify influential points using Cook's distance
cookf1 <- cooks.distance(lmod1)
halfnorm(cookf1, 3, labs = index, ylab = "Cook's distances")
## Eliminating and re-running the full model
lmodf1 <- lm(TARGET_AMT ~ .-INDEX - TARGET_FLAG, training, subset = (cookf1 < max(cookf1)))
summary(lmodf1)
##
## Call:
## lm(formula = TARGET_AMT ~ . - INDEX - TARGET_FLAG, data = training,
## subset = (cookf1 < max(cookf1)))
##
## Residuals:
## Min 1Q Median 3Q Max
## -5804 -1680 -759 340 83455
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.059e+03 5.389e+02 1.965 0.049466 *
## KIDSDRIV 3.202e+02 1.095e+02 2.926 0.003448 **
## AGE 4.102e+00 6.834e+00 0.600 0.548381
## HOMEKIDS 8.287e+01 6.321e+01 1.311 0.189934
## YOJ -3.657e+00 1.459e+01 -0.251 0.802093
## INCOME -4.479e-03 1.746e-03 -2.565 0.010331 *
## PARENT1Yes 5.555e+02 1.954e+02 2.843 0.004485 **
## HOME_VAL -5.192e-04 5.714e-04 -0.909 0.363614
## MSTATUSz_No 6.085e+02 1.401e+02 4.343 1.42e-05 ***
## SEXz_F -3.902e+02 1.778e+02 -2.195 0.028205 *
## EDUCATIONBachelors -2.768e+02 1.981e+02 -1.398 0.162294
## EDUCATIONMasters 2.381e+01 2.901e+02 0.082 0.934583
## EDUCATIONPhD 2.827e+02 3.443e+02 0.821 0.411560
## EDUCATIONz_High School -6.530e+01 1.663e+02 -0.393 0.694561
## JOBClerical 5.083e+02 3.302e+02 1.539 0.123820
## JOBDoctor -5.464e+02 3.954e+02 -1.382 0.167001
## JOBHome Maker 3.187e+02 3.527e+02 0.903 0.366293
## JOBLawyer 1.760e+02 2.860e+02 0.615 0.538327
## JOBManager -5.056e+02 2.790e+02 -1.812 0.070017 .
## JOBProfessional 3.540e+02 2.987e+02 1.185 0.236024
## JOBStudent 2.939e+02 3.617e+02 0.813 0.416489
## JOBz_Blue Collar 5.327e+02 3.113e+02 1.711 0.087117 .
## TRAVTIME 1.152e+01 3.117e+00 3.697 0.000220 ***
## CAR_USEPrivate -7.126e+02 1.591e+02 -4.480 7.57e-06 ***
## BLUEBOOK 1.524e-02 8.341e-03 1.828 0.067615 .
## TIF -4.526e+01 1.178e+01 -3.842 0.000123 ***
## CAR_TYPEPanel Truck 3.017e+02 2.691e+02 1.121 0.262190
## CAR_TYPEPickup 4.033e+02 1.651e+02 2.443 0.014594 *
## CAR_TYPESports Car 1.029e+03 2.106e+02 4.884 1.06e-06 ***
## CAR_TYPEVan 4.003e+02 2.063e+02 1.940 0.052365 .
## CAR_TYPEz_SUV 7.534e+02 1.734e+02 4.344 1.42e-05 ***
## RED_CARyes -8.938e+01 1.442e+02 -0.620 0.535353
## OLDCLAIM -7.926e-03 7.193e-03 -1.102 0.270546
## CLM_FREQ 1.205e+02 5.323e+01 2.263 0.023656 *
## REVOKEDYes 5.420e+02 1.678e+02 3.229 0.001246 **
## MVR_PTS 1.630e+02 2.508e+01 6.501 8.46e-11 ***
## CAR_AGE -2.215e+01 1.237e+01 -1.791 0.073401 .
## URBANICITYz_Highly Rural/ Rural -1.665e+03 1.348e+02 -12.353 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4395 on 8122 degrees of freedom
## Multiple R-squared: 0.07319, Adjusted R-squared: 0.06897
## F-statistic: 17.33 on 37 and 8122 DF, p-value: < 2.2e-16
summary(lmod1)
##
## Call:
## lm(formula = TARGET_AMT ~ . - INDEX - TARGET_FLAG, data = training)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5891 -1698 -760 344 103785
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.065e+03 5.572e+02 1.912 0.05595 .
## KIDSDRIV 3.142e+02 1.132e+02 2.777 0.00551 **
## AGE 5.220e+00 7.066e+00 0.739 0.46005
## HOMEKIDS 7.764e+01 6.535e+01 1.188 0.23487
## YOJ -3.952e+00 1.509e+01 -0.262 0.79336
## INCOME -4.412e-03 1.805e-03 -2.444 0.01453 *
## PARENT1Yes 5.762e+02 2.020e+02 2.852 0.00435 **
## HOME_VAL -5.580e-04 5.908e-04 -0.945 0.34492
## MSTATUSz_No 5.701e+02 1.449e+02 3.935 8.38e-05 ***
## SEXz_F -3.688e+02 1.838e+02 -2.007 0.04481 *
## EDUCATIONBachelors -2.590e+02 2.048e+02 -1.265 0.20598
## EDUCATIONMasters 2.347e+01 2.999e+02 0.078 0.93764
## EDUCATIONPhD 2.863e+02 3.559e+02 0.804 0.42127
## EDUCATIONz_High School -8.888e+01 1.719e+02 -0.517 0.60520
## JOBClerical 5.293e+02 3.414e+02 1.550 0.12110
## JOBDoctor -4.997e+02 4.087e+02 -1.223 0.22151
## JOBHome Maker 3.524e+02 3.646e+02 0.967 0.33380
## JOBLawyer 2.308e+02 2.956e+02 0.781 0.43498
## JOBManager -4.787e+02 2.885e+02 -1.660 0.09704 .
## JOBProfessional 4.565e+02 3.088e+02 1.478 0.13936
## JOBStudent 2.876e+02 3.740e+02 0.769 0.44180
## JOBz_Blue Collar 5.077e+02 3.218e+02 1.577 0.11473
## TRAVTIME 1.195e+01 3.222e+00 3.709 0.00021 ***
## CAR_USEPrivate -7.798e+02 1.644e+02 -4.743 2.14e-06 ***
## BLUEBOOK 1.433e-02 8.623e-03 1.662 0.09647 .
## TIF -4.820e+01 1.218e+01 -3.958 7.63e-05 ***
## CAR_TYPEPanel Truck 2.648e+02 2.782e+02 0.952 0.34112
## CAR_TYPEPickup 3.754e+02 1.707e+02 2.200 0.02786 *
## CAR_TYPESports Car 1.022e+03 2.178e+02 4.693 2.74e-06 ***
## CAR_TYPEVan 5.145e+02 2.132e+02 2.413 0.01584 *
## CAR_TYPEz_SUV 7.518e+02 1.793e+02 4.193 2.78e-05 ***
## RED_CARyes -4.820e+01 1.491e+02 -0.323 0.74641
## OLDCLAIM -1.057e-02 7.436e-03 -1.421 0.15527
## CLM_FREQ 1.417e+02 5.503e+01 2.575 0.01005 *
## REVOKEDYes 5.494e+02 1.735e+02 3.166 0.00155 **
## MVR_PTS 1.753e+02 2.592e+01 6.765 1.43e-11 ***
## CAR_AGE -2.681e+01 1.279e+01 -2.097 0.03606 *
## URBANICITYz_Highly Rural/ Rural -1.665e+03 1.394e+02 -11.943 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4544 on 8123 degrees of freedom
## Multiple R-squared: 0.07104, Adjusted R-squared: 0.06681
## F-statistic: 16.79 on 37 and 8123 DF, p-value: < 2.2e-16
# Model 2:
par(mfrow = c(2, 2))
plot(lmod2)
#Plot of successive pairs of residuals to check for serial correlation
n2 <- length(residuals(lmod2))
plot(tail(residuals(lmod2), n1-1) ~ head(residuals(lmod2), n1-1), xlab = expression(hat(epsilon)[i]), ylab = expression(hat(epsilon)[i+1]))
abline(h = 0, v = 0, col = grey(0.75))
par(mfrow = c(2, 2))
#Check for leverage points using half-normal plots
hatv2 <- hatvalues(lmod2)
sum(hatv2)
## [1] 6
index <- row.names(training)
halfnorm(hatv2, labs = index, ylab = "Leverages")
## Identify influential points using Cook's distance
cookf2 <- cooks.distance(lmod2)
halfnorm(cookf2, 3, labs = index, ylab = "Cook's distances")
## Eliminating and re-running the full model
lmodf2 <- lm(TARGET_AMT ~ INCOME + YOJ + CLM_FREQ + MVR_PTS + BLUEBOOK, training, subset = (cookf2 < max(cookf2)))
summary(lmodf2)
##
## Call:
## lm(formula = TARGET_AMT ~ INCOME + YOJ + CLM_FREQ + MVR_PTS +
## BLUEBOOK, data = training, subset = (cookf2 < max(cookf2)))
##
## Residuals:
## Min 1Q Median 3Q Max
## -4223 -1509 -988 -338 104277
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.112e+03 1.707e+02 6.513 7.82e-11 ***
## INCOME -5.765e-03 1.238e-03 -4.656 3.28e-06 ***
## YOJ -6.384e+00 1.320e+01 -0.483 0.6288
## CLM_FREQ 2.673e+02 4.757e+01 5.618 2.00e-08 ***
## MVR_PTS 2.381e+02 2.568e+01 9.273 < 2e-16 ***
## BLUEBOOK 1.196e-02 6.618e-03 1.807 0.0708 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4566 on 8154 degrees of freedom
## Multiple R-squared: 0.02638, Adjusted R-squared: 0.02578
## F-statistic: 44.18 on 5 and 8154 DF, p-value: < 2.2e-16
summary(lmod2)
##
## Call:
## lm(formula = TARGET_AMT ~ INCOME + YOJ + CLM_FREQ + MVR_PTS +
## BLUEBOOK, data = training)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4328 -1519 -985 -361 104225
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.037e+03 1.735e+02 5.978 2.35e-09 ***
## INCOME -5.763e-03 1.259e-03 -4.576 4.81e-06 ***
## YOJ -3.854e+00 1.343e+01 -0.287 0.7741
## CLM_FREQ 2.939e+02 4.836e+01 6.078 1.28e-09 ***
## MVR_PTS 2.336e+02 2.611e+01 8.947 < 2e-16 ***
## BLUEBOOK 1.471e-02 6.728e-03 2.186 0.0288 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4643 on 8155 degrees of freedom
## Multiple R-squared: 0.02627, Adjusted R-squared: 0.02567
## F-statistic: 44 on 5 and 8155 DF, p-value: < 2.2e-16
# Model 3:
par(mfrow = c(2, 2))
plot(lmod3)
#Plot of successive pairs of residuals to check for serial correlation
n3 <- length(residuals(lmod3))
plot(tail(residuals(lmod3), n1-1) ~ head(residuals(lmod3), n1-1), xlab = expression(hat(epsilon)[i]), ylab = expression(hat(epsilon)[i+1]))
abline(h = 0, v = 0, col = grey(0.75))
par(mfrow = c(2, 2))
#Check for leverage points using half-normal plots
hatv3 <- hatvalues(lmod3)
sum(hatv3)
## [1] 23
index <- row.names(training)
halfnorm(hatv3, labs = index, ylab = "Leverages")
## Identify influential points using Cook's distance
cookf3 <- cooks.distance(lmod3)
halfnorm(cookf3, 3, labs = index, ylab = "Cook's distances")
## Eliminating and re-running the full model
lmodf3 <- lm(TARGET_AMT ~ AGE + CLM_FREQ + HOME_VAL + INCOME + JOB + KIDSDRIV + MSTATUS + MVR_PTS + OLDCLAIM + RED_CAR + REVOKED + SEX + TIF + TRAVTIME + YOJ, training, subset = (cookf3 < max(cookf3)))
summary(lmodf3)
##
## Call:
## lm(formula = TARGET_AMT ~ AGE + CLM_FREQ + HOME_VAL + INCOME +
## JOB + KIDSDRIV + MSTATUS + MVR_PTS + OLDCLAIM + RED_CAR +
## REVOKED + SEX + TIF + TRAVTIME + YOJ, data = training, subset = (cookf3 <
## max(cookf3)))
##
## Residuals:
## Min 1Q Median 3Q Max
## -4801 -1611 -889 66 104563
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.426e+03 4.502e+02 3.168 0.001539 **
## AGE -3.716e+00 6.167e+00 -0.602 0.546893
## CLM_FREQ 2.624e+02 5.368e+01 4.888 1.04e-06 ***
## HOME_VAL -5.909e-04 5.869e-04 -1.007 0.314020
## INCOME -3.221e-03 1.670e-03 -1.928 0.053840 .
## JOBClerical -3.078e+02 2.687e+02 -1.146 0.251924
## JOBDoctor -8.823e+02 3.540e+02 -2.492 0.012717 *
## JOBHome Maker -5.283e+02 3.245e+02 -1.628 0.103562
## JOBLawyer -5.236e+02 2.603e+02 -2.012 0.044263 *
## JOBManager -8.871e+02 2.513e+02 -3.531 0.000417 ***
## JOBProfessional -2.881e+02 2.496e+02 -1.154 0.248360
## JOBStudent -2.885e+02 3.158e+02 -0.914 0.360994
## JOBz_Blue Collar 1.774e+02 2.430e+02 0.730 0.465472
## KIDSDRIV 3.741e+02 9.912e+01 3.774 0.000162 ***
## MSTATUSz_No 6.761e+02 1.250e+02 5.408 6.54e-08 ***
## MVR_PTS 2.162e+02 2.569e+01 8.416 < 2e-16 ***
## OLDCLAIM -1.030e-02 7.400e-03 -1.391 0.164114
## RED_CARyes -3.790e+00 1.483e+02 -0.026 0.979618
## REVOKEDYes 7.558e+02 1.721e+02 4.391 1.14e-05 ***
## SEXz_F -8.125e+01 1.389e+02 -0.585 0.558591
## TIF -4.561e+01 1.212e+01 -3.763 0.000169 ***
## TRAVTIME 4.829e+00 3.171e+00 1.523 0.127844
## YOJ -7.489e-01 1.468e+01 -0.051 0.959307
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4527 on 8137 degrees of freedom
## Multiple R-squared: 0.04479, Adjusted R-squared: 0.04221
## F-statistic: 17.34 on 22 and 8137 DF, p-value: < 2.2e-16
summary(lmod3)
##
## Call:
## lm(formula = TARGET_AMT ~ AGE + CLM_FREQ + HOME_VAL + INCOME +
## JOB + KIDSDRIV + MSTATUS + MVR_PTS + OLDCLAIM + RED_CAR +
## REVOKED + SEX + TIF + TRAVTIME + YOJ, data = training)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4892 -1622 -885 73 104539
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.410e+03 4.578e+02 3.080 0.002076 **
## AGE -2.536e+00 6.271e+00 -0.404 0.685992
## CLM_FREQ 2.915e+02 5.456e+01 5.343 9.39e-08 ***
## HOME_VAL -3.655e-04 5.966e-04 -0.613 0.540190
## INCOME -3.512e-03 1.698e-03 -2.068 0.038685 *
## JOBClerical -4.402e+02 2.731e+02 -1.612 0.107013
## JOBDoctor -1.025e+03 3.599e+02 -2.847 0.004428 **
## JOBHome Maker -6.426e+02 3.299e+02 -1.948 0.051497 .
## JOBLawyer -6.608e+02 2.645e+02 -2.498 0.012517 *
## JOBManager -1.024e+03 2.554e+02 -4.010 6.13e-05 ***
## JOBProfessional -4.283e+02 2.536e+02 -1.689 0.091353 .
## JOBStudent -3.898e+02 3.211e+02 -1.214 0.224887
## JOBz_Blue Collar 3.704e+01 2.470e+02 0.150 0.880802
## KIDSDRIV 4.057e+02 1.008e+02 4.026 5.73e-05 ***
## MSTATUSz_No 7.310e+02 1.271e+02 5.752 9.13e-09 ***
## MVR_PTS 2.119e+02 2.612e+01 8.114 5.60e-16 ***
## OLDCLAIM -1.158e-02 7.524e-03 -1.538 0.123984
## RED_CARyes -5.204e+01 1.508e+02 -0.345 0.730041
## REVOKEDYes 7.558e+02 1.751e+02 4.317 1.60e-05 ***
## SEXz_F -1.174e+02 1.412e+02 -0.832 0.405685
## TIF -4.541e+01 1.233e+01 -3.684 0.000231 ***
## TRAVTIME 5.846e+00 3.224e+00 1.813 0.069861 .
## YOJ 3.011e+00 1.492e+01 0.202 0.840115
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4604 on 8138 degrees of freedom
## Multiple R-squared: 0.04482, Adjusted R-squared: 0.04224
## F-statistic: 17.36 on 22 and 8138 DF, p-value: < 2.2e-16
# Model 4:
par(mfrow = c(2, 2))
plot(lmod4)
#Plot of successive pairs of residuals to check for serial correlation
n4 <- length(residuals(lmod4))
plot(tail(residuals(lmod4), n1-1) ~ head(residuals(lmod4), n1-1), xlab = expression(hat(epsilon)[i]), ylab = expression(hat(epsilon)[i+1]))
abline(h = 0, v = 0, col = grey(0.75))
par(mfrow = c(2, 2))
#Check for leverage points using half-normal plots
hatv4 <- hatvalues(lmod4)
sum(hatv4)
## [1] 38
index <- row.names(training)
halfnorm(hatv4, labs = index, ylab = "Leverages")
## Identify influential points using Cook's distance
cookf4 <- cooks.distance(lmod4)
halfnorm(cookf4, 3, labs = index, ylab = "Cook's distances")
## Eliminating and re-running the full model
lmodf4 <- lm(TARGET_AMT ~ .- INDEX - TARGET_FLAG, data = trainingdata, subset = (cookf4 < max(cookf4)))
summary(lmodf4)
##
## Call:
## lm(formula = TARGET_AMT ~ . - INDEX - TARGET_FLAG, data = trainingdata,
## subset = (cookf4 < max(cookf4)))
##
## Residuals:
## Min 1Q Median 3Q Max
## -5526 -1693 -766 364 104244
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -372.058 1135.130 -0.328 0.743096
## KIDSDRIV 593.933 203.789 2.914 0.003573 **
## AGE 4.749 7.206 0.659 0.509880
## HOMEKIDS 168.655 145.221 1.161 0.245526
## YOJ 11.920 18.547 0.643 0.520448
## INCOME -60.145 30.624 -1.964 0.049563 *
## PARENT1Yes 538.234 209.171 2.573 0.010095 *
## HOME_VAL -14.153 12.557 -1.127 0.259709
## MSTATUSz_No 573.407 151.282 3.790 0.000152 ***
## SEXz_F -373.051 179.400 -2.079 0.037608 *
## EDUCATIONBachelors -354.357 203.752 -1.739 0.082045 .
## EDUCATIONMasters -187.435 287.704 -0.651 0.514752
## EDUCATIONPhD -64.913 334.505 -0.194 0.846136
## EDUCATIONz_High School -118.539 171.830 -0.690 0.490302
## JOBClerical 653.340 336.913 1.939 0.052512 .
## JOBDoctor -490.400 409.139 -1.199 0.230713
## JOBHome Maker 425.780 370.236 1.150 0.250168
## JOBLawyer 285.024 295.321 0.965 0.334507
## JOBManager -454.246 288.359 -1.575 0.115232
## JOBProfessional 507.436 308.378 1.645 0.099906 .
## JOBStudent 278.395 389.881 0.714 0.475216
## JOBz_Blue Collar 569.576 320.921 1.775 0.075966 .
## TRAVTIME 338.607 89.185 3.797 0.000148 ***
## CAR_USEPrivate -787.554 164.637 -4.784 1.75e-06 ***
## BLUEBOOK 147.943 102.692 1.441 0.149722
## TIF -288.710 71.646 -4.030 5.64e-05 ***
## CAR_TYPEPanel Truck 278.177 263.031 1.058 0.290277
## CAR_TYPEPickup 385.039 170.588 2.257 0.024026 *
## CAR_TYPESports Car 1030.181 216.054 4.768 1.89e-06 ***
## CAR_TYPEVan 503.527 212.446 2.370 0.017805 *
## CAR_TYPEz_SUV 758.838 174.246 4.355 1.35e-05 ***
## RED_CARyes -47.326 149.211 -0.317 0.751117
## OLDCLAIM 21.152 24.010 0.881 0.378358
## CLM_FREQ 67.856 85.740 0.791 0.428721
## REVOKEDYes 425.034 156.860 2.710 0.006750 **
## MVR_PTS 420.522 76.698 5.483 4.31e-08 ***
## CAR_AGE -144.500 82.260 -1.757 0.079018 .
## URBANICITYz_Highly Rural/ Rural -1638.965 139.992 -11.708 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4549 on 8121 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.06937, Adjusted R-squared: 0.06513
## F-statistic: 16.36 on 37 and 8121 DF, p-value: < 2.2e-16
summary(lmod4)
##
## Call:
## lm(formula = TARGET_AMT ~ . - INDEX - TARGET_FLAG, data = trainingdata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5524 -1692 -767 364 104245
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -381.997 1134.865 -0.337 0.736426
## KIDSDRIV 594.482 203.775 2.917 0.003540 **
## AGE 4.778 7.205 0.663 0.507297
## HOMEKIDS 169.279 145.208 1.166 0.243742
## YOJ 11.914 18.547 0.642 0.520633
## INCOME -60.141 30.622 -1.964 0.049569 *
## PARENT1Yes 535.818 209.094 2.563 0.010408 *
## HOME_VAL -14.259 12.554 -1.136 0.256058
## MSTATUSz_No 573.026 151.273 3.788 0.000153 ***
## SEXz_F -373.629 179.387 -2.083 0.037299 *
## EDUCATIONBachelors -354.551 203.742 -1.740 0.081862 .
## EDUCATIONMasters -188.315 287.684 -0.655 0.512750
## EDUCATIONPhD -65.879 334.482 -0.197 0.843866
## EDUCATIONz_High School -119.226 171.815 -0.694 0.487752
## JOBClerical 654.016 336.893 1.941 0.052255 .
## JOBDoctor -490.708 409.119 -1.199 0.230397
## JOBHome Maker 426.700 370.213 1.153 0.249116
## JOBLawyer 284.819 295.306 0.964 0.334830
## JOBManager -454.536 288.345 -1.576 0.114981
## JOBProfessional 507.414 308.363 1.646 0.099904 .
## JOBStudent 278.878 389.861 0.715 0.474427
## JOBz_Blue Collar 569.229 320.905 1.774 0.076130 .
## TRAVTIME 338.622 89.181 3.797 0.000148 ***
## CAR_USEPrivate -786.794 164.620 -4.779 1.79e-06 ***
## BLUEBOOK 148.802 102.670 1.449 0.147286
## TIF -288.216 71.635 -4.023 5.79e-05 ***
## CAR_TYPEPanel Truck 278.092 263.019 1.057 0.290402
## CAR_TYPEPickup 385.536 170.576 2.260 0.023835 *
## CAR_TYPESports Car 1030.863 216.038 4.772 1.86e-06 ***
## CAR_TYPEVan 503.431 212.436 2.370 0.017821 *
## CAR_TYPEz_SUV 758.477 174.236 4.353 1.36e-05 ***
## RED_CARyes -47.304 149.203 -0.317 0.751220
## OLDCLAIM 21.153 24.008 0.881 0.378306
## CLM_FREQ 67.172 85.722 0.784 0.433296
## REVOKEDYes 425.330 156.851 2.712 0.006709 **
## MVR_PTS 421.208 76.679 5.493 4.07e-08 ***
## CAR_AGE -143.949 82.247 -1.750 0.080121 .
## URBANICITYz_Highly Rural/ Rural -1640.568 139.941 -11.723 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4549 on 8122 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.06936, Adjusted R-squared: 0.06512
## F-statistic: 16.36 on 37 and 8122 DF, p-value: < 2.2e-16
We decide to use model 7 as our final logit model as it has the least AIC value (7363.1). Furthermore, model 7 also has the highest accuracy (0.78), the lowest classification error rate (0.22), is tied on precision with Model 4 (0.783), ranks low on sensitivity (0.969) but has the highest specificity (0.252) and F1 (0.866) values.
Please note I had to look around for some help on the calculation of the confusion matrix and ROC curves.
Based on these diagnostics, I decided to use model 7 to predict results using the evaluation dataset.
library(caret)
## Warning: package 'caret' was built under R version 4.0.5
## Loading required package: lattice
##
## Attaching package: 'lattice'
## The following object is masked from 'package:faraway':
##
## melanoma
##
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
##
## lift
library(kableExtra)
library(dplyr)
formula(gmod1)
## TARGET_FLAG ~ (INDEX + TARGET_AMT + KIDSDRIV + AGE + HOMEKIDS +
## YOJ + INCOME + PARENT1 + HOME_VAL + MSTATUS + SEX + EDUCATION +
## JOB + TRAVTIME + CAR_USE + BLUEBOOK + TIF + CAR_TYPE + RED_CAR +
## OLDCLAIM + CLM_FREQ + REVOKED + MVR_PTS + CAR_AGE + URBANICITY) -
## INDEX - TARGET_AMT
formula(gmod2)
## TARGET_FLAG ~ INCOME + YOJ + CLM_FREQ + MVR_PTS + BLUEBOOK
formula(gmod3)
## TARGET_FLAG ~ AGE + CLM_FREQ + HOME_VAL + INCOME + JOB + KIDSDRIV +
## MSTATUS + MVR_PTS + OLDCLAIM + RED_CAR + REVOKED + SEX +
## TIF + TRAVTIME + YOJ
formula(gmod4)
## TARGET_FLAG ~ (INDEX + TARGET_AMT + KIDSDRIV + AGE + HOMEKIDS +
## YOJ + INCOME + PARENT1 + HOME_VAL + MSTATUS + SEX + EDUCATION +
## JOB + TRAVTIME + CAR_USE + BLUEBOOK + TIF + CAR_TYPE + RED_CAR +
## OLDCLAIM + CLM_FREQ + REVOKED + MVR_PTS + CAR_AGE + URBANICITY) -
## INDEX - TARGET_AMT
preds1 = predict(gmod1, newdata = training)
preds2 = predict(gmod2, newdata = training)
preds3 = predict(gmod3, newdata = training)
preds4 = predict(gmod4, newdata = trainingdata)
preds1[preds1 >= 0.5] <- 1
preds1[preds1 < 0.5] <- 0
preds1 = as.factor(preds1)
preds2[preds2 >= 0.5] <- 1
preds2[preds2 < 0.5] <- 0
preds2 = as.factor(preds2)
preds3[preds3 >= 0.5] <- 1
preds3[preds3 < 0.5] <- 0
preds3 = as.factor(preds3)
preds4[preds4 >= 0.5] <- 1
preds4[preds4 < 0.5] <- 0
preds4[is.na(preds4)] <- 0
preds4 = as.factor(preds4)
m1 <- confusionMatrix(preds1, as.factor(training$TARGET_FLAG), mode = "everything")
m2 <- confusionMatrix(preds2, as.factor(training$TARGET_FLAG), mode = "everything")
m3 <- confusionMatrix(preds3, as.factor(training$TARGET_FLAG), mode = "everything")
m4 <- confusionMatrix(preds4, as.factor(trainingdata$TARGET_FLAG), mode = "everything")
temp <- data.frame(m1$overall,
m2$overall,
m3$overall,
m4$overall) %>%
t() %>%
data.frame() %>%
dplyr::select(Accuracy) %>%
mutate(Classification_Error_Rate = 1-Accuracy)
Summ_Stat <-data.frame(m1$byClass,
m2$byClass,
m3$byClass,
m4$byClass) %>%
t() %>%
data.frame() %>%
cbind(temp) %>%
mutate(Model = c("Model 1", "Model 2", "Model 3", "Model 4")) %>%
dplyr::select(Model, Accuracy, Classification_Error_Rate, Precision, Sensitivity, Specificity, F1) %>%
mutate_if(is.numeric, round,3) %>%
kable('html', escape = F) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),full_width = F)
Summ_Stat
| Model | Accuracy | Classification_Error_Rate | Precision | Sensitivity | Specificity | F1 |
|---|---|---|---|---|---|---|
| Model 1 | 0.780 | 0.220 | 0.783 | 0.969 | 0.252 | 0.866 |
| Model 2 | 0.742 | 0.258 | 0.743 | 0.994 | 0.041 | 0.850 |
| Model 3 | 0.755 | 0.245 | 0.757 | 0.983 | 0.120 | 0.855 |
| Model 4 | 0.778 | 0.222 | 0.783 | 0.967 | 0.251 | 0.865 |
getROC <- function(model) {
name <- deparse(substitute(model))
pred.prob1 <- predict(model, newdata = training)
p1 <- data.frame(pred = training$TARGET_FLAG, prob = pred.prob1)
p1 <- p1[order(p1$prob),]
rocobj <- pROC::roc(p1$pred, p1$prob)
plot(rocobj, asp=NA, legacy.axes = TRUE, print.auc=TRUE,
xlab="Specificity", main = name)
}
getROC1 <- function(model) {
name <- deparse(substitute(model))
pred.prob1 <- predict(model, newdata = trainingdata)
p1 <- data.frame(pred = trainingdata$TARGET_FLAG, prob = pred.prob1)
p1 <- p1[order(p1$prob),]
rocobj <- pROC::roc(p1$pred, p1$prob)
plot(rocobj, asp=NA, legacy.axes = TRUE, print.auc=TRUE,
xlab="Specificity", main = name)
}
par(mfrow=c(3,3))
getROC(gmod1)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
getROC(gmod2)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
getROC(gmod3)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
getROC1(gmod4)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
We use the logit model 7 (gmod4) to predict TARGET_FLAG and then use lmodf1 (The basic linear model after removing influential points) to predict TARGET_AMT.
#Convert variables to numeric class
eval$TARGET_AMT <- as.numeric(gsub('[$,]', '', eval$TARGET_AMT))
eval$INCOME <- as.numeric(gsub('[$,]', '', eval$INCOME))
eval$HOME_VAL <- as.numeric(gsub('[$,]', '', eval$HOME_VAL))
eval$BLUEBOOK <- as.numeric(gsub('[$,]', '', eval$BLUEBOOK))
eval$OLDCLAIM <- as.numeric(gsub('[$,]', '', eval$OLDCLAIM))
# We use the logit model 7 (gmod4) to predict TARGET_FLAG and then use lmodf1 (The basic linear model after removing influential points) to predict TARGET_AMT and write to an output file.
eval_probs <- predict(gmod4, newdata = eval, type='response')
eval$TARGET_FLAG <- ifelse(eval_probs > 0.5, 1, 0)
eval$TARGET_AMT <- 0.0
eval[which(eval$TARGET_FLAG == 1),]$TARGET_AMT <- predict(lmodf1, newdata = eval %>% filter(TARGET_FLAG == 1))
write.csv(eval, file = '/Users/tponnada/Downloads/predictHW4.csv')