Assignment 4

Overview

DATA 621 – Business Analytics and Data Mining

In this homework assignment, you will explore, analyze and model a data set containing approximately 8,000 records representing a customer at an auto insurance company. Each record has two response variables. The first response variable, TARGET_FLAG, is a 1 or a 0. A “1” means that the person was in a car crash. A zero means that the person was not in a car crash. The second response variable is TARGET_AMT. This value is zero if the person did not crash their car. But if they did crash their car, this number will be a value greater than zero.

Your objective is to build multiple linear regression and binary logistic regression models on the training data to predict the probability that a person will crash their car and also the amount of money it will cost if the person does crash their car. You can only use the variables given to you (or variables that you derive from the variables provided). Below is a short description of the variables of interest in the data set:

VARIABLE NAME DEFINITION THEORETICAL EFFECT INDEX Identification Variable (do not use) None TARGET_FLAG Was Car in a crash? 1=YES 0=NO None TARGET_AMT If car was in a crash, what was the cost None AGE Age of Driver Very young people tend to be risky. Maybe very old people also. BLUEBOOK Value of Vehicle Unknown effect on probability of collision, but probably effect the payout if there is a crash CAR_AGE Vehicle Age Unknown effect on probability of collision, but probably effect the payout if there is a crash CAR_TYPE Type of Car Unknown effect on probability of collision, but probably effect the payout if there is a crash CAR_USE Vehicle Use Commercial vehicles are driven more, so might increase probability of collision CLM_FREQ # Claims (Past 5 Years) The more claims you filed in the past, the more you are likely to file in the future EDUCATION Max Education Level Unknown effect, but in theory more educated people tend to drive more safely HOMEKIDS # Children at Home Unknown effect HOME_VAL Home Value In theory, home owners tend to drive more responsibly INCOME Income In theory, rich people tend to get into fewer crashes JOB Job Category In theory, white collar jobs tend to be safer KIDSDRIV # Driving Children When teenagers drive your car, you are more likely to get into crashes MSTATUS Marital Status In theory, married people drive more safely MVR_PTS Motor Vehicle Record Points If you get lots of traffic tickets, you tend to get into more crashes OLDCLAIM Total Claims (Past 5 Years) If your total payout over the past five years was high, this suggests future payouts will be high PARENT1 Single Parent Unknown effect RED_CAR A Red Car Urban legend says that red cars (especially red sports cars) are more risky. Is that true? REVOKED License Revoked (Past 7 Years) If your license was revoked in the past 7 years, you probably are a more risky driver. SEX Gender Urban legend says that women have less crashes then men. Is that true? TIF Time in Force People who have been customers for a long time are usually more safe. TRAV TIME Distance to Work Long drives to work usually suggest greater risk URBANICITY Home/Work Area Unknown YOJ Years on Job People who stay at a job for a long time are usually more safe

Deliverables:

A write-up submitted in PDF format. Your write-up should have four sections. Each one is described below. You may assume you are addressing me as a fellow data scientist, so do not need to shy away from technical details.

Assigned predictions (probabilities, classifications, cost) for the evaluation data set. Use 0.5 threshold.

Include your R statistical programming code in an Appendix.

Write Up:

SUMMARY:

1. DATA EXPLORATION (25 Points)

Describe the size and the variables in the insurance training data set. Consider that too much detail will cause a manager to lose interest while too little detail will make the manager consider that you aren’t doing your job. Some suggestions are given below. Please do NOT treat this as a check list of things to do to complete the assignment. You should have your own thoughts on what to tell the boss. These are just ideas.

  1. Mean / Standard Deviation / Median
  2. Bar Chart or Box Plot of the data
  3. Is the data correlated to the target variable (or to other variables?)
  4. Are any of the variables missing and need to be imputed “fixed”?

DATA EXPLORATION SOLUTION

Per guidance in the problem, I’ve included below both a numerical summary (mean, median, standard deviation) of numeric variables in the training dataset as well as histogram plots for those variables. We also observe that there are some variables such as INCOME, HOME_VAL, BLUEBOOK and OLDCLAIM which should be numeric but have come through as characters. We convert such variables to numeric class to allow analysis.

We notice there are few variables with NA’s - Age, YOJ, Income, Home Value and Car Age. Additionally, there are several variables (TARGET_FLAG, TARGET_AMT, KIDSDRIV, HOMEKIDS, YOJ, INCOME, HOME_VAL, TIF, OLDCLAIM, CLM_FREQ, MVR_PTS) which have minimum values of 0’s. Upon investigating the records in the file, all of these columns thta have zero values appear to be legitimate and not ones entered in error. Finally, a correlation plot indicates the numeric correlation values (-1 to +1) of the quantitative variables in the dataset. Similarly, looking at the maximum values doesn’t cause any alarm.

A visual plot of the variables indicates most have a skewed distribution. There appears to be only a few normally distributed variables - Age of driver and Years on job.

From the correlation matrix plot, we identify a few variables that are moderately or weakly positively correlated - HOME_VAL & INCOME (0.58), TARGET_FLAG & TARGET_AMT (0.54), CLM_FREQ & OLDCLAIM (0.49), BLUEBOOK & INCOME (0.43), CAR_AGE & INCOME (0.41), MVR_PTS & CLM_FREQ (0.40), INCOME & YOJ (0.28), HOME_VAL & YOJ (0.27), MVR_PTS & OLDCLAIM (0.27), BLUEBOOK & HOME_VAL (0.26). Similarly, there are some variables which are moderately negatively correlated - HOMEKIDS & AGE (-0.45).

This knowledge helps us narrow the size of the predictive model by including fewer variables. For example, HOME_VAL & INCOME have a positive correlation of 0.58 indicating that the higher the income of the driver, the higher his home value which makes intuitive sense. Also HOMEKIDS & AGE are inversely correlated, which means the younger the driver, the more the number of kids at home perhaps because his driving license is a training permit and he’s not yet allowed to drive young children around.

We look at the target variable which indicates whether a car was involved in crash or not and observe that the target has a moderate positive correlation to TARGET_AMT (0.54), MVR_PTS (0.23), CLM_FREQ (0.22) and OLDCLAIM (0.14). This makes intuitive sense since if the car was involved in a crash, there will be a cost associated with it. Furthermore, it is likely if the car was involved in a crash that you were previously issued traffic tickets and that you likely filed some claims in the past and that there was a monetary value associated with those past claims.

library(rmarkdown)
library(corrplot)
## corrplot 0.92 loaded
library(stringr)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
#Reading in the training and evaluation data files

training <- read.csv("/Users/tponnada/Downloads/insurance_training_data.csv")
eval <- read.csv("/Users/tponnada/Downloads/insurance-evaluation-data.csv")

#Checking the first 6 rows of the training data set, the dimensions of the data set and the usual univariate summary information.

head(training)
##   INDEX TARGET_FLAG TARGET_AMT KIDSDRIV AGE HOMEKIDS YOJ   INCOME PARENT1
## 1     1           0          0        0  60        0  11  $67,349      No
## 2     2           0          0        0  43        0  11  $91,449      No
## 3     4           0          0        0  35        1  10  $16,039      No
## 4     5           0          0        0  51        0  14               No
## 5     6           0          0        0  50        0  NA $114,986      No
## 6     7           1       2946        0  34        1  12 $125,301     Yes
##   HOME_VAL MSTATUS SEX     EDUCATION           JOB TRAVTIME    CAR_USE BLUEBOOK
## 1       $0    z_No   M           PhD  Professional       14    Private  $14,230
## 2 $257,252    z_No   M z_High School z_Blue Collar       22 Commercial  $14,940
## 3 $124,191     Yes z_F z_High School      Clerical        5    Private   $4,010
## 4 $306,251     Yes   M  <High School z_Blue Collar       32    Private  $15,440
## 5 $243,925     Yes z_F           PhD        Doctor       36    Private  $18,000
## 6       $0    z_No z_F     Bachelors z_Blue Collar       46 Commercial  $17,430
##   TIF   CAR_TYPE RED_CAR OLDCLAIM CLM_FREQ REVOKED MVR_PTS CAR_AGE
## 1  11    Minivan     yes   $4,461        2      No       3      18
## 2   1    Minivan     yes       $0        0      No       0       1
## 3   4      z_SUV      no  $38,690        2      No       3      10
## 4   7    Minivan     yes       $0        0      No       0       6
## 5   1      z_SUV      no  $19,217        2     Yes       3      17
## 6   1 Sports Car      no       $0        0      No       0       7
##            URBANICITY
## 1 Highly Urban/ Urban
## 2 Highly Urban/ Urban
## 3 Highly Urban/ Urban
## 4 Highly Urban/ Urban
## 5 Highly Urban/ Urban
## 6 Highly Urban/ Urban
dim(training)
## [1] 8161   26
summary(training)
##      INDEX        TARGET_FLAG       TARGET_AMT        KIDSDRIV     
##  Min.   :    1   Min.   :0.0000   Min.   :     0   Min.   :0.0000  
##  1st Qu.: 2559   1st Qu.:0.0000   1st Qu.:     0   1st Qu.:0.0000  
##  Median : 5133   Median :0.0000   Median :     0   Median :0.0000  
##  Mean   : 5152   Mean   :0.2638   Mean   :  1504   Mean   :0.1711  
##  3rd Qu.: 7745   3rd Qu.:1.0000   3rd Qu.:  1036   3rd Qu.:0.0000  
##  Max.   :10302   Max.   :1.0000   Max.   :107586   Max.   :4.0000  
##                                                                    
##       AGE           HOMEKIDS           YOJ          INCOME         
##  Min.   :16.00   Min.   :0.0000   Min.   : 0.0   Length:8161       
##  1st Qu.:39.00   1st Qu.:0.0000   1st Qu.: 9.0   Class :character  
##  Median :45.00   Median :0.0000   Median :11.0   Mode  :character  
##  Mean   :44.79   Mean   :0.7212   Mean   :10.5                     
##  3rd Qu.:51.00   3rd Qu.:1.0000   3rd Qu.:13.0                     
##  Max.   :81.00   Max.   :5.0000   Max.   :23.0                     
##  NA's   :6                        NA's   :454                      
##    PARENT1            HOME_VAL           MSTATUS              SEX           
##  Length:8161        Length:8161        Length:8161        Length:8161       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##   EDUCATION             JOB               TRAVTIME        CAR_USE         
##  Length:8161        Length:8161        Min.   :  5.00   Length:8161       
##  Class :character   Class :character   1st Qu.: 22.00   Class :character  
##  Mode  :character   Mode  :character   Median : 33.00   Mode  :character  
##                                        Mean   : 33.49                     
##                                        3rd Qu.: 44.00                     
##                                        Max.   :142.00                     
##                                                                           
##    BLUEBOOK              TIF           CAR_TYPE           RED_CAR         
##  Length:8161        Min.   : 1.000   Length:8161        Length:8161       
##  Class :character   1st Qu.: 1.000   Class :character   Class :character  
##  Mode  :character   Median : 4.000   Mode  :character   Mode  :character  
##                     Mean   : 5.351                                        
##                     3rd Qu.: 7.000                                        
##                     Max.   :25.000                                        
##                                                                           
##    OLDCLAIM            CLM_FREQ        REVOKED             MVR_PTS      
##  Length:8161        Min.   :0.0000   Length:8161        Min.   : 0.000  
##  Class :character   1st Qu.:0.0000   Class :character   1st Qu.: 0.000  
##  Mode  :character   Median :0.0000   Mode  :character   Median : 1.000  
##                     Mean   :0.7986                      Mean   : 1.696  
##                     3rd Qu.:2.0000                      3rd Qu.: 3.000  
##                     Max.   :5.0000                      Max.   :13.000  
##                                                                         
##     CAR_AGE        URBANICITY       
##  Min.   :-3.000   Length:8161       
##  1st Qu.: 1.000   Class :character  
##  Median : 8.000   Mode  :character  
##  Mean   : 8.328                     
##  3rd Qu.:12.000                     
##  Max.   :28.000                     
##  NA's   :510
#Convert variables to numeric class

training$INCOME <- as.numeric(gsub('[$,]', '', training$INCOME))
training$HOME_VAL <- as.numeric(gsub('[$,]', '', training$HOME_VAL))
training$BLUEBOOK <- as.numeric(gsub('[$,]', '', training$BLUEBOOK))
training$OLDCLAIM <- as.numeric(gsub('[$,]', '', training$OLDCLAIM))

summary(training)
##      INDEX        TARGET_FLAG       TARGET_AMT        KIDSDRIV     
##  Min.   :    1   Min.   :0.0000   Min.   :     0   Min.   :0.0000  
##  1st Qu.: 2559   1st Qu.:0.0000   1st Qu.:     0   1st Qu.:0.0000  
##  Median : 5133   Median :0.0000   Median :     0   Median :0.0000  
##  Mean   : 5152   Mean   :0.2638   Mean   :  1504   Mean   :0.1711  
##  3rd Qu.: 7745   3rd Qu.:1.0000   3rd Qu.:  1036   3rd Qu.:0.0000  
##  Max.   :10302   Max.   :1.0000   Max.   :107586   Max.   :4.0000  
##                                                                    
##       AGE           HOMEKIDS           YOJ           INCOME      
##  Min.   :16.00   Min.   :0.0000   Min.   : 0.0   Min.   :     0  
##  1st Qu.:39.00   1st Qu.:0.0000   1st Qu.: 9.0   1st Qu.: 28097  
##  Median :45.00   Median :0.0000   Median :11.0   Median : 54028  
##  Mean   :44.79   Mean   :0.7212   Mean   :10.5   Mean   : 61898  
##  3rd Qu.:51.00   3rd Qu.:1.0000   3rd Qu.:13.0   3rd Qu.: 85986  
##  Max.   :81.00   Max.   :5.0000   Max.   :23.0   Max.   :367030  
##  NA's   :6                        NA's   :454    NA's   :445     
##    PARENT1             HOME_VAL        MSTATUS              SEX           
##  Length:8161        Min.   :     0   Length:8161        Length:8161       
##  Class :character   1st Qu.:     0   Class :character   Class :character  
##  Mode  :character   Median :161160   Mode  :character   Mode  :character  
##                     Mean   :154867                                        
##                     3rd Qu.:238724                                        
##                     Max.   :885282                                        
##                     NA's   :464                                           
##   EDUCATION             JOB               TRAVTIME        CAR_USE         
##  Length:8161        Length:8161        Min.   :  5.00   Length:8161       
##  Class :character   Class :character   1st Qu.: 22.00   Class :character  
##  Mode  :character   Mode  :character   Median : 33.00   Mode  :character  
##                                        Mean   : 33.49                     
##                                        3rd Qu.: 44.00                     
##                                        Max.   :142.00                     
##                                                                           
##     BLUEBOOK          TIF           CAR_TYPE           RED_CAR         
##  Min.   : 1500   Min.   : 1.000   Length:8161        Length:8161       
##  1st Qu.: 9280   1st Qu.: 1.000   Class :character   Class :character  
##  Median :14440   Median : 4.000   Mode  :character   Mode  :character  
##  Mean   :15710   Mean   : 5.351                                        
##  3rd Qu.:20850   3rd Qu.: 7.000                                        
##  Max.   :69740   Max.   :25.000                                        
##                                                                        
##     OLDCLAIM        CLM_FREQ        REVOKED             MVR_PTS      
##  Min.   :    0   Min.   :0.0000   Length:8161        Min.   : 0.000  
##  1st Qu.:    0   1st Qu.:0.0000   Class :character   1st Qu.: 0.000  
##  Median :    0   Median :0.0000   Mode  :character   Median : 1.000  
##  Mean   : 4037   Mean   :0.7986                      Mean   : 1.696  
##  3rd Qu.: 4636   3rd Qu.:2.0000                      3rd Qu.: 3.000  
##  Max.   :57037   Max.   :5.0000                      Max.   :13.000  
##                                                                      
##     CAR_AGE        URBANICITY       
##  Min.   :-3.000   Length:8161       
##  1st Qu.: 1.000   Class :character  
##  Median : 8.000   Mode  :character  
##  Mean   : 8.328                     
##  3rd Qu.:12.000                     
##  Max.   :28.000                     
##  NA's   :510
#Standard deviations of variables 

sapply(training[,c(3:8, 10, 15, 17:18, 21:22, 24:25)], sd)
##   TARGET_AMT     KIDSDRIV          AGE     HOMEKIDS          YOJ       INCOME 
## 4704.0269298    0.5115341           NA    1.1163233           NA           NA 
##     HOME_VAL     TRAVTIME     BLUEBOOK          TIF     OLDCLAIM     CLM_FREQ 
##           NA   15.9083334 8419.7340755    4.1466353 8777.1391044    1.1584527 
##      MVR_PTS      CAR_AGE 
##    2.1471117           NA
#Univariate plots using histograms, kernel density estimates and sorted data plotted against its index for the 14 numeric variables.

#par(mfrow=c(,10))

#If car was in a crash, what was the cost

hist(training$TARGET_AMT, xlab = "Cost if car was in a crash", main = "")

plot(density(training$TARGET_AMT, na.rm = TRUE), main = "")

plot(sort(training$TARGET_AMT), ylab = "Cost if car was in a crash")

boxplot(TARGET_AMT ~ TARGET_FLAG, training)

#Number of driving children

hist(training$KIDSDRIV, xlab = "Number of driving children", main = "")

plot(density(training$KIDSDRIV, na.rm = TRUE), main = "")

plot(sort(training$KIDSDRIV), ylab = "Number of driving children")

boxplot(KIDSDRIV ~ TARGET_FLAG, training)

#Age of driver

hist(training$AGE, xlab = "Age of driver", main = "")

plot(density(training$AGE, na.rm = TRUE), main = "")

plot(sort(training$AGE), ylab = "Age of driver")

boxplot(AGE ~ TARGET_FLAG, training)

#Number of children at home

hist(training$HOMEKIDS, xlab = "Number of children at home", main = "")

plot(density(training$HOMEKIDS, na.rm = TRUE), main = "")

plot(sort(training$HOMEKIDS), ylab = "Number of children at home")

boxplot(HOMEKIDS ~ TARGET_FLAG, training)

#Years on job

hist(training$YOJ, xlab = "Years on job", main = "")

plot(density(training$YOJ, na.rm = TRUE), main = "")

plot(sort(training$YOJ), ylab = "Years on job")

boxplot(YOJ ~ TARGET_FLAG, training)

#Income

hist(training$INCOME, xlab = "Income", main = "")

plot(density(training$INCOME, na.rm = TRUE), main = "")

plot(sort(training$INCOME), ylab = "Income")

boxplot(INCOME ~ TARGET_FLAG, training)

#Home Value

hist(training$HOME_VAL, xlab = "Home Value", main = "")

plot(density(training$HOME_VAL, na.rm = TRUE), main = "")

plot(sort(training$HOME_VAL), ylab = "Home Value")

boxplot(HOME_VAL ~ TARGET_FLAG, training)

#Travel Time

hist(training$TRAVTIME, xlab = "Travel Time", main = "")

plot(density(training$TRAVTIME, na.rm = TRUE), main = "")

plot(sort(training$TRAVTIME), ylab = "Travel Time")

boxplot(TRAVTIME ~ TARGET_FLAG, training)

#Bluebook value of vehicle

hist(training$BLUEBOOK, xlab = "Bluebook value of vehicle", main = "")

plot(density(training$BLUEBOOK, na.rm = TRUE), main = "")

plot(sort(training$BLUEBOOK), ylab = "Bluebook value of vehicle")

boxplot(BLUEBOOK ~ TARGET_FLAG, training)

#Time in Force

hist(training$TIF, xlab = "Time in Force", main = "")

plot(density(training$TIF, na.rm = TRUE), main = "")

plot(sort(training$TIF), ylab = "Time in Force")

boxplot(TIF ~ TARGET_FLAG, training)

#Old Claims

hist(training$OLDCLAIM, xlab = "Total Claims", main = "")

plot(density(training$OLDCLAIM, na.rm = TRUE), main = "")

plot(sort(training$OLDCLAIM), ylab = "Total Claims")

boxplot(OLDCLAIM ~ TARGET_FLAG, training)

#Number of claims

hist(training$CLM_FREQ, xlab = "Number of claims", main = "")

plot(density(training$CLM_FREQ, na.rm = TRUE), main = "")

plot(sort(training$CLM_FREQ), ylab = "Number of claims")

boxplot(CLM_FREQ ~ TARGET_FLAG, training)

#Motor Vehicle Record Points

hist(training$MVR_PTS, xlab = "Motor Vehicle Record Points", main = "")

plot(density(training$MVR_PTS, na.rm = TRUE), main = "")

plot(sort(training$MVR_PTS), ylab = "Motor Vehicle Record Points")

boxplot(MVR_PTS ~ TARGET_FLAG, training)

#Vehicle age

hist(training$CAR_AGE, xlab = "Vehicle age", main = "")

plot(density(training$CAR_AGE, na.rm = TRUE), main = "")

plot(sort(training$CAR_AGE), ylab = "Vehicle age")

boxplot(CAR_AGE ~ TARGET_FLAG, training)

#Instead of using scatterplots for each of the 15 variables against each other, I used the correlation matrix.

df_new <- (training[,c(2, 3:8, 10, 15, 17:18, 21:22, 24:25)])
p.mat <- cor(df_new, use = "na.or.complete")
corrplot(p.mat, method = 'number', type = 'lower', diag = FALSE, number.cex = 0.6, tl.cex = 0.6, cl.cex = 0.6)

2. DATA PREPARATION (25 Points)

Describe how you have transformed the data by changing the original variables or creating new variables. If you did transform the data or create new variables, discuss why you did this. Here are some possible transformations. a. Fix missing values (maybe with a Mean or Median value) b. Create flags to suggest if a variable was missing c. Transform data by putting it into buckets d. Mathematical transforms such as log or square root (or use Box-Cox) e. Combine variables (such as ratios or adding or multiplying) to create new variables

DATA PREPARATION SOLUTION

As observed in the data exploration section, there are some variables with missing NA values. AGE has 6 (0.1%) records that have NA values, YOJ has 454 (5.6%) records that have NA values, INCOME has 445 (5.5%) records that have NA values, HOME_VAL has 464 (5.7%) records that have NA values, CAR_AGE has 510 (6.2%) records that have NA values. We fill the missing values for these variables with the median values.

Additionally, in the data exploration section, we observed that there are some variables such as INCOME, HOME_VAL, BLUEBOOK and OLDCLAIM which should be numeric but have come through as characters and we converted such variables to numeric class to allow analysis.

We also calculate the variance inflation factor (vif) of each variable, which measures the strength and correlation between the predictor variables in a regression model. A value > 5 indicates potentially severe correlation between a given predictor variable and other predictors in the model. Based on this threshold and the calculated vif scores below, we decide to eliminate three predictors - EDUCATION (vif score: 10.491987), JOB (vif score: 23.948522) and CAR_TYPE (vif score: 5.568248) from the model. Finally we consider log transformation on the positively skewed variables - Number of driving children (KIDSDRIV), Number of children at home (HOMEKIDS), Income (INCOME), Home Value (HOME_VAL), Travel Time (TRAVTIME), Bluebook value of vehicle (BLUEBOOK), Time in Force (TIF), Total Claims (OLDCLAIM), Motor Vehicle Record Points (MVR_PTS) and Vehicle Age (CAR_AGE).

The target variable is binomially distributed and we cannot use a linear model since a linear model requires that the errors be approximately normally distributed and furthermore the variance of a binomial variable is not constant which violates another crucial assumption of the linear model. Hence, in the next phase, we experiment with various binomial models.

require(MASS)
## Loading required package: MASS
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
require(car)
## Loading required package: car
## Loading required package: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
## 
##     recode
require(kableExtra)
## Loading required package: kableExtra
## 
## Attaching package: 'kableExtra'
## The following object is masked from 'package:dplyr':
## 
##     group_rows
require(dplyr)

sapply(training, function(x) sum(is.na(x))) %>% kable() %>% kable_styling()
x
INDEX 0
TARGET_FLAG 0
TARGET_AMT 0
KIDSDRIV 0
AGE 6
HOMEKIDS 0
YOJ 454
INCOME 445
PARENT1 0
HOME_VAL 464
MSTATUS 0
SEX 0
EDUCATION 0
JOB 0
TRAVTIME 0
CAR_USE 0
BLUEBOOK 0
TIF 0
CAR_TYPE 0
RED_CAR 0
OLDCLAIM 0
CLM_FREQ 0
REVOKED 0
MVR_PTS 0
CAR_AGE 510
URBANICITY 0
##Fill missing records with missing values

training <- training %>% mutate_at(vars(c("AGE", "YOJ", "INCOME", "HOME_VAL", "CAR_AGE")), ~ifelse(is.na(.), median(., na.rm = TRUE), .))

lmod <- lm(TARGET_FLAG ~., training)
summary(lmod)
## 
## Call:
## lm(formula = TARGET_FLAG ~ ., data = training)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.8474 -0.2217 -0.0888  0.1519  1.0779 
## 
## Coefficients:
##                                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                      2.452e-01  4.199e-02   5.840 5.42e-09 ***
## INDEX                            3.646e-07  1.263e-06   0.289 0.772887    
## TARGET_AMT                       4.156e-05  8.273e-07  50.238  < 2e-16 ***
## KIDSDRIV                         4.743e-02  8.442e-03   5.619 1.98e-08 ***
## AGE                             -3.442e-04  5.271e-04  -0.653 0.513708    
## HOMEKIDS                         3.418e-03  4.873e-03   0.701 0.483136    
## YOJ                             -1.980e-03  1.125e-03  -1.760 0.078455 .  
## INCOME                          -1.932e-07  1.346e-07  -1.435 0.151445    
## PARENT1Yes                       5.249e-02  1.507e-02   3.483 0.000499 ***
## HOME_VAL                        -1.456e-07  4.405e-08  -3.306 0.000952 ***
## MSTATUSz_No                      4.626e-02  1.081e-02   4.279 1.90e-05 ***
## SEXz_F                           1.074e-03  1.371e-02   0.078 0.937531    
## EDUCATIONBachelors              -4.644e-02  1.527e-02  -3.041 0.002364 ** 
## EDUCATIONMasters                -3.584e-02  2.237e-02  -1.602 0.109112    
## EDUCATIONPhD                    -3.681e-02  2.654e-02  -1.387 0.165556    
## EDUCATIONz_High School           9.713e-03  1.282e-02   0.758 0.448709    
## JOBClerical                      7.181e-02  2.546e-02   2.820 0.004811 ** 
## JOBDoctor                       -1.804e-02  3.048e-02  -0.592 0.553944    
## JOBHome Maker                    5.856e-02  2.719e-02   2.154 0.031300 *  
## JOBLawyer                        1.731e-02  2.205e-02   0.785 0.432413    
## JOBManager                      -4.284e-02  2.151e-02  -1.991 0.046468 *  
## JOBProfessional                  3.017e-02  2.302e-02   1.310 0.190118    
## JOBStudent                       6.065e-02  2.788e-02   2.175 0.029668 *  
## JOBz_Blue Collar                 5.786e-02  2.400e-02   2.411 0.015938 *  
## TRAVTIME                         1.500e-03  2.405e-04   6.236 4.72e-10 ***
## CAR_USEPrivate                  -8.738e-02  1.228e-02  -7.118 1.19e-12 ***
## BLUEBOOK                        -3.225e-06  6.430e-07  -5.016 5.38e-07 ***
## TIF                             -5.928e-03  9.089e-04  -6.522 7.36e-11 ***
## CAR_TYPEPanel Truck              4.463e-02  2.074e-02   2.152 0.031442 *  
## CAR_TYPEPickup                   5.556e-02  1.273e-02   4.364 1.29e-05 ***
## CAR_TYPESports Car               1.007e-01  1.626e-02   6.193 6.18e-10 ***
## CAR_TYPEVan                      5.199e-02  1.591e-02   3.269 0.001084 ** 
## CAR_TYPEz_SUV                    7.245e-02  1.338e-02   5.413 6.36e-08 ***
## RED_CARyes                      -1.443e-03  1.111e-02  -0.130 0.896718    
## OLDCLAIM                        -1.946e-06  5.546e-07  -3.509 0.000451 ***
## CLM_FREQ                         2.654e-02  4.106e-03   6.464 1.08e-10 ***
## REVOKEDYes                       1.309e-01  1.295e-02  10.107  < 2e-16 ***
## MVR_PTS                          1.397e-02  1.938e-03   7.208 6.18e-13 ***
## CAR_AGE                          7.772e-04  9.538e-04   0.815 0.415213    
## URBANICITYz_Highly Rural/ Rural -2.285e-01  1.048e-02 -21.798  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3388 on 8121 degrees of freedom
## Multiple R-squared:  0.4118, Adjusted R-squared:  0.409 
## F-statistic: 145.8 on 39 and 8121 DF,  p-value: < 2.2e-16
vif(lmod)
##                 GVIF Df GVIF^(1/(2*Df))
## INDEX       1.006790  1        1.003389
## TARGET_AMT  1.076470  1        1.037531
## KIDSDRIV    1.325523  1        1.151313
## AGE         1.468762  1        1.211925
## HOMEKIDS    2.103905  1        1.450484
## YOJ         1.423873  1        1.193261
## INCOME      2.761398  1        1.661746
## PARENT1     1.849681  1        1.360030
## HOME_VAL    2.169271  1        1.472845
## MSTATUS     1.994638  1        1.412317
## SEX         3.321252  1        1.822430
## EDUCATION  10.491987  4        1.341551
## JOB        23.948522  8        1.219565
## TRAVTIME    1.040740  1        1.020166
## CAR_USE     2.500501  1        1.581297
## BLUEBOOK    2.083593  1        1.443466
## TIF         1.009773  1        1.004875
## CAR_TYPE    5.568248  5        1.187331
## RED_CAR     1.813024  1        1.346486
## OLDCLAIM    1.684062  1        1.297714
## CLM_FREQ    1.608148  1        1.268128
## REVOKED     1.281171  1        1.131888
## MVR_PTS     1.230716  1        1.109376
## CAR_AGE     1.970736  1        1.403829
## URBANICITY  1.271210  1        1.127480
trainingdata <- training

#Log transformation on number of driving children
trainingdata$KIDSDRIV <- log(trainingdata$KIDSDRIV + 1)

#Log transformation on number of children at home
trainingdata$HOMEKIDS <- log(trainingdata$HOMEKIDS + 1)

#Log transformation on number of income
trainingdata$INCOME <- log(trainingdata$INCOME + 1)

#Log transformation on home value
trainingdata$HOME_VAL <- log(trainingdata$HOME_VAL + 1)

#Log transformation on travel time
trainingdata$TRAVTIME <- log(trainingdata$TRAVTIME + 1)

#Log transformation on bluebook value of vehicle
trainingdata$BLUEBOOK <- log(trainingdata$BLUEBOOK + 1)

#Log transformation on time in Force
trainingdata$TIF <- log(trainingdata$TIF + 1)

#Log transformation on old claims
trainingdata$OLDCLAIM <- log(trainingdata$OLDCLAIM + 1)

#Log transformation on motor vehicle Record Points
trainingdata$MVR_PTS <- log(trainingdata$MVR_PTS + 1)

#Log transformation on motor vehicle age
trainingdata$CAR_AGE <- log(trainingdata$CAR_AGE + 1)
## Warning in log(trainingdata$CAR_AGE + 1): NaNs produced

3. BUILD MODELS (25 Points)

Using the training data set, build at least two different multiple linear regression models and three different binary logistic regression models, using different variables (or the same variables with different transformations). You may select the variables manually, use an approach such as Forward or Stepwise, use a different approach such as trees, or use a combination of techniques. Describe the techniques you used. If you manually selected a variable for inclusion into the model or exclusion into the model, indicate why this was done.

Discuss the coefficients in the models, do they make sense? For example, if a person has a lot of traffic tickets, you would reasonably expect that person to have more car crashes. If the coefficient is negative (suggesting that the person is a safer driver), then that needs to be discussed. Are you keeping the model even though it is counter intuitive? Why? The boss needs to know.

BUILD MODELS SOLUTION

For multiple linear regression model 1, we use all the predictor variables except INDEX and TARGET_FLAG to predict TARGET_AMT. We notice that KIDSDRIV, INCOME, PARENT1Yes, MSTATUSz_No, SEXz_F, TRAVTIME, CAR_USEPrivate, TIF, CAR_TYPE, CLM_FREQ, REVOKEDYes, MVR_PTS, CAR-AGE and URBANICITYz_Highly Rural have significant p-values. The adjusted R-sqaured of this basic multiple linear regression model is 6.681%.

For multiple linear regression model 2, we use the following as predictor variables - INCOME, YOJ, CLM_FREQ, MVR_PTS and BLUEBOOK. I selected these variables manually by looking at the correlated variable pairs from the data exploration section and selecting one variable each among the highest correlated pairs. Intuitively, one would expect that drivers with higher incomes and more number of years of employment would be more cautious driving their cars resulting in fewer accidents and fewer claims as these drivers are in a good position financially and don’t want to jeopardize that. On the contrary, drivers with a prior record of claims and/or existing motor vehicle record points are likely to be more involved in accidents as these drivers have little to loose from driving rashly. We see from model 2, that all of the selected predictor variables are significant in the model except YOJ. The adjusted R-squared of this parismonious multiple linear regression model however is worse than the full model at 2.567%.

For multiple linear regression model 3, we use variables which theoretically should have an impact on the amount claimed if a car is involved in a crash as identified in the problem - AGE, CLM_FREQ, HOME_VAL, INCOME, JOB, KIDSDRIV, MSTATUS, MVR_RPTS, OLDCLAIM, RED_CAR, REVOKED, SEX, TIF, TRAVTIME and YOJ. We see from model 3, that all of the selected predictor variables are significant in the model except AGE, HOME_VAL, JOBClerical, JOBHome Maker, JOBProfessional, JOBStudent, JOBz_Blue Collar, RED_CARyes, OLDCLAIM, SEXz_F, TRAVTIME and YOJ. The adjusted R-sqaured of this theoretical multiple linear regression model is only 4.224%.

I also ran a multiple linear regression model (model 8) with transformed independent variables excluding TARGET_FLAG and INDEX but this model had an adjusted R-squared value of only 6.512%, better than models 2 and 3 but slightly worse than the baseline multiple linear regression model.

Regarding the binary logistics regression models, we select a basic logit model for model 4 that includes all predictor variables in their untransformed state excluding the TARGET_AMT since this variable will have values only when the TARGET_FLAG is 1 (i.e. car is in a crash). We also exclude the INDEX variable since this doesn’t have any explanatory value. From this initial model, we observe that KIDSDRIV, INCOME, PARENT1Yes, HOME_VAL, MSTATUSz_No, EDUCATIONBachelors, JOBClerical, JOBManager, TRAVTIME, CAR_USEPrivate, BLUEBOOK, TIF, CAR_TYPE, OLDCLAIM, CLM_FREQ, REVOKEDYes, MVR_PTS, CAR_AGE and URBANICITY have significant p-values and these are some of the very same variables that were found to be significant in the basic linear model 1. To test the goodness of fit, we assume that the null hypothesis specifies that the model is correctly specified. We pass the residual deviance along with the model degrees of freedom to pchisq and find that there is no evidence to reject the null hypothesis that the model fits for model 4. Deviance < (n-p) = 7297.6 < (8123 - 38) = 7297.6 < 8085

For model 5, we deploy a logit model but use the variables used in model 2. However we observe that there is evidence to reject the null hypothesis that the model fits. Deviance > (n-p) = 8728.1 > (8155 - 6 = 8149)

For model 6, we use variables which theoretically should have an impact on the probability of car crashes and the amount claimed if a car is involved in a crash as identified in the problem - AGE, CLM_FREQ, HOME_VAL, INCOME, JOB, KIDSDRIV, MSTATUS, MVR_RPTS, OLDCLAIM, RED_CAR, REVOKED, SEX, TIF, TRAVTIME, YOJ. Again, we observe similarly as we did with model 4 that there is some evidence to reject the null hypothesis that the model fits. Deviance > (n-p) = 8176.4 > (8138 - 23 = 8115). AIC for models 5 and 6 increases respectively to 8740.1 and 8222.4 from 7373.6 (model 4 AIC).

Finally, for model 7, we use log transformed variables as developed in the data preparation phase above and exlude TARGET_AMT and INDEX as we did with model 4 which was our baseline logit model. From this model that uses transformed values for positively skewed variables, we observe that KIDSDRIV, INCOME, PARENT1Yes, HOME_VAL, MSTATUSz_No, EDUCATIONBachelors, JOBClerical, JOBManager, JOBz_Blue Collar, TRAVTIME, CAR_USEPrivate, BLUEBOOK, TIF, CAR_TYPE, OLDCLAIM, CLM_FREQ, REVOKEDYes, MVR_PTS and URBANICITY have significant p-values and these are some of the very same variables that were found to be significant in the basic logit model 4 and also in the basic linear model 1. Finally, we pass the residual deviance along with the model degrees of freedom to pchisq and find that there is no evidence to reject the null hypothesis and find similarly as we did with baseline logit model 4 that the model fits for model 7. Furthermore the AIC is improved slightly to 7363.1 from baseline logit model 4’s AIC value of 7373.6.

We decide to use model 7 as our final logit model and model 1 as our baseline multiple linear regression model for evaluation purposes.

require(MASS)
require(car)
require(kableExtra)
require(dplyr)

##Model 1:

lmod1 <- lm(TARGET_AMT ~ .-INDEX - TARGET_FLAG, training)
summary(lmod1)
## 
## Call:
## lm(formula = TARGET_AMT ~ . - INDEX - TARGET_FLAG, data = training)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -5891  -1698   -760    344 103785 
## 
## Coefficients:
##                                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                      1.065e+03  5.572e+02   1.912  0.05595 .  
## KIDSDRIV                         3.142e+02  1.132e+02   2.777  0.00551 ** 
## AGE                              5.220e+00  7.066e+00   0.739  0.46005    
## HOMEKIDS                         7.764e+01  6.535e+01   1.188  0.23487    
## YOJ                             -3.952e+00  1.509e+01  -0.262  0.79336    
## INCOME                          -4.412e-03  1.805e-03  -2.444  0.01453 *  
## PARENT1Yes                       5.762e+02  2.020e+02   2.852  0.00435 ** 
## HOME_VAL                        -5.580e-04  5.908e-04  -0.945  0.34492    
## MSTATUSz_No                      5.701e+02  1.449e+02   3.935 8.38e-05 ***
## SEXz_F                          -3.688e+02  1.838e+02  -2.007  0.04481 *  
## EDUCATIONBachelors              -2.590e+02  2.048e+02  -1.265  0.20598    
## EDUCATIONMasters                 2.347e+01  2.999e+02   0.078  0.93764    
## EDUCATIONPhD                     2.863e+02  3.559e+02   0.804  0.42127    
## EDUCATIONz_High School          -8.888e+01  1.719e+02  -0.517  0.60520    
## JOBClerical                      5.293e+02  3.414e+02   1.550  0.12110    
## JOBDoctor                       -4.997e+02  4.087e+02  -1.223  0.22151    
## JOBHome Maker                    3.524e+02  3.646e+02   0.967  0.33380    
## JOBLawyer                        2.308e+02  2.956e+02   0.781  0.43498    
## JOBManager                      -4.787e+02  2.885e+02  -1.660  0.09704 .  
## JOBProfessional                  4.565e+02  3.088e+02   1.478  0.13936    
## JOBStudent                       2.876e+02  3.740e+02   0.769  0.44180    
## JOBz_Blue Collar                 5.077e+02  3.218e+02   1.577  0.11473    
## TRAVTIME                         1.195e+01  3.222e+00   3.709  0.00021 ***
## CAR_USEPrivate                  -7.798e+02  1.644e+02  -4.743 2.14e-06 ***
## BLUEBOOK                         1.433e-02  8.623e-03   1.662  0.09647 .  
## TIF                             -4.820e+01  1.218e+01  -3.958 7.63e-05 ***
## CAR_TYPEPanel Truck              2.648e+02  2.782e+02   0.952  0.34112    
## CAR_TYPEPickup                   3.754e+02  1.707e+02   2.200  0.02786 *  
## CAR_TYPESports Car               1.022e+03  2.178e+02   4.693 2.74e-06 ***
## CAR_TYPEVan                      5.145e+02  2.132e+02   2.413  0.01584 *  
## CAR_TYPEz_SUV                    7.518e+02  1.793e+02   4.193 2.78e-05 ***
## RED_CARyes                      -4.820e+01  1.491e+02  -0.323  0.74641    
## OLDCLAIM                        -1.057e-02  7.436e-03  -1.421  0.15527    
## CLM_FREQ                         1.417e+02  5.503e+01   2.575  0.01005 *  
## REVOKEDYes                       5.494e+02  1.735e+02   3.166  0.00155 ** 
## MVR_PTS                          1.753e+02  2.592e+01   6.765 1.43e-11 ***
## CAR_AGE                         -2.681e+01  1.279e+01  -2.097  0.03606 *  
## URBANICITYz_Highly Rural/ Rural -1.665e+03  1.394e+02 -11.943  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4544 on 8123 degrees of freedom
## Multiple R-squared:  0.07104,    Adjusted R-squared:  0.06681 
## F-statistic: 16.79 on 37 and 8123 DF,  p-value: < 2.2e-16
##Model 2:

lmod2 <- lm(TARGET_AMT ~ INCOME + YOJ + CLM_FREQ + MVR_PTS + BLUEBOOK, training)
summary(lmod2)
## 
## Call:
## lm(formula = TARGET_AMT ~ INCOME + YOJ + CLM_FREQ + MVR_PTS + 
##     BLUEBOOK, data = training)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -4328  -1519   -985   -361 104225 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.037e+03  1.735e+02   5.978 2.35e-09 ***
## INCOME      -5.763e-03  1.259e-03  -4.576 4.81e-06 ***
## YOJ         -3.854e+00  1.343e+01  -0.287   0.7741    
## CLM_FREQ     2.939e+02  4.836e+01   6.078 1.28e-09 ***
## MVR_PTS      2.336e+02  2.611e+01   8.947  < 2e-16 ***
## BLUEBOOK     1.471e-02  6.728e-03   2.186   0.0288 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4643 on 8155 degrees of freedom
## Multiple R-squared:  0.02627,    Adjusted R-squared:  0.02567 
## F-statistic:    44 on 5 and 8155 DF,  p-value: < 2.2e-16
##Model 3:

lmod3 <- lm(TARGET_AMT ~ AGE + CLM_FREQ + HOME_VAL + INCOME + JOB + KIDSDRIV + MSTATUS + MVR_PTS + OLDCLAIM + RED_CAR + REVOKED + SEX + TIF + TRAVTIME + YOJ, training)
summary(lmod3)
## 
## Call:
## lm(formula = TARGET_AMT ~ AGE + CLM_FREQ + HOME_VAL + INCOME + 
##     JOB + KIDSDRIV + MSTATUS + MVR_PTS + OLDCLAIM + RED_CAR + 
##     REVOKED + SEX + TIF + TRAVTIME + YOJ, data = training)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -4892  -1622   -885     73 104539 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       1.410e+03  4.578e+02   3.080 0.002076 ** 
## AGE              -2.536e+00  6.271e+00  -0.404 0.685992    
## CLM_FREQ          2.915e+02  5.456e+01   5.343 9.39e-08 ***
## HOME_VAL         -3.655e-04  5.966e-04  -0.613 0.540190    
## INCOME           -3.512e-03  1.698e-03  -2.068 0.038685 *  
## JOBClerical      -4.402e+02  2.731e+02  -1.612 0.107013    
## JOBDoctor        -1.025e+03  3.599e+02  -2.847 0.004428 ** 
## JOBHome Maker    -6.426e+02  3.299e+02  -1.948 0.051497 .  
## JOBLawyer        -6.608e+02  2.645e+02  -2.498 0.012517 *  
## JOBManager       -1.024e+03  2.554e+02  -4.010 6.13e-05 ***
## JOBProfessional  -4.283e+02  2.536e+02  -1.689 0.091353 .  
## JOBStudent       -3.898e+02  3.211e+02  -1.214 0.224887    
## JOBz_Blue Collar  3.704e+01  2.470e+02   0.150 0.880802    
## KIDSDRIV          4.057e+02  1.008e+02   4.026 5.73e-05 ***
## MSTATUSz_No       7.310e+02  1.271e+02   5.752 9.13e-09 ***
## MVR_PTS           2.119e+02  2.612e+01   8.114 5.60e-16 ***
## OLDCLAIM         -1.158e-02  7.524e-03  -1.538 0.123984    
## RED_CARyes       -5.204e+01  1.508e+02  -0.345 0.730041    
## REVOKEDYes        7.558e+02  1.751e+02   4.317 1.60e-05 ***
## SEXz_F           -1.174e+02  1.412e+02  -0.832 0.405685    
## TIF              -4.541e+01  1.233e+01  -3.684 0.000231 ***
## TRAVTIME          5.846e+00  3.224e+00   1.813 0.069861 .  
## YOJ               3.011e+00  1.492e+01   0.202 0.840115    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4604 on 8138 degrees of freedom
## Multiple R-squared:  0.04482,    Adjusted R-squared:  0.04224 
## F-statistic: 17.36 on 22 and 8138 DF,  p-value: < 2.2e-16
## Model 8:

lmod4 <- lm(TARGET_AMT ~ .- INDEX - TARGET_FLAG, data = trainingdata)
summary(lmod4)
## 
## Call:
## lm(formula = TARGET_AMT ~ . - INDEX - TARGET_FLAG, data = trainingdata)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -5524  -1692   -767    364 104245 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                      -381.997   1134.865  -0.337 0.736426    
## KIDSDRIV                          594.482    203.775   2.917 0.003540 ** 
## AGE                                 4.778      7.205   0.663 0.507297    
## HOMEKIDS                          169.279    145.208   1.166 0.243742    
## YOJ                                11.914     18.547   0.642 0.520633    
## INCOME                            -60.141     30.622  -1.964 0.049569 *  
## PARENT1Yes                        535.818    209.094   2.563 0.010408 *  
## HOME_VAL                          -14.259     12.554  -1.136 0.256058    
## MSTATUSz_No                       573.026    151.273   3.788 0.000153 ***
## SEXz_F                           -373.629    179.387  -2.083 0.037299 *  
## EDUCATIONBachelors               -354.551    203.742  -1.740 0.081862 .  
## EDUCATIONMasters                 -188.315    287.684  -0.655 0.512750    
## EDUCATIONPhD                      -65.879    334.482  -0.197 0.843866    
## EDUCATIONz_High School           -119.226    171.815  -0.694 0.487752    
## JOBClerical                       654.016    336.893   1.941 0.052255 .  
## JOBDoctor                        -490.708    409.119  -1.199 0.230397    
## JOBHome Maker                     426.700    370.213   1.153 0.249116    
## JOBLawyer                         284.819    295.306   0.964 0.334830    
## JOBManager                       -454.536    288.345  -1.576 0.114981    
## JOBProfessional                   507.414    308.363   1.646 0.099904 .  
## JOBStudent                        278.878    389.861   0.715 0.474427    
## JOBz_Blue Collar                  569.229    320.905   1.774 0.076130 .  
## TRAVTIME                          338.622     89.181   3.797 0.000148 ***
## CAR_USEPrivate                   -786.794    164.620  -4.779 1.79e-06 ***
## BLUEBOOK                          148.802    102.670   1.449 0.147286    
## TIF                              -288.216     71.635  -4.023 5.79e-05 ***
## CAR_TYPEPanel Truck               278.092    263.019   1.057 0.290402    
## CAR_TYPEPickup                    385.536    170.576   2.260 0.023835 *  
## CAR_TYPESports Car               1030.863    216.038   4.772 1.86e-06 ***
## CAR_TYPEVan                       503.431    212.436   2.370 0.017821 *  
## CAR_TYPEz_SUV                     758.477    174.236   4.353 1.36e-05 ***
## RED_CARyes                        -47.304    149.203  -0.317 0.751220    
## OLDCLAIM                           21.153     24.008   0.881 0.378306    
## CLM_FREQ                           67.172     85.722   0.784 0.433296    
## REVOKEDYes                        425.330    156.851   2.712 0.006709 ** 
## MVR_PTS                           421.208     76.679   5.493 4.07e-08 ***
## CAR_AGE                          -143.949     82.247  -1.750 0.080121 .  
## URBANICITYz_Highly Rural/ Rural -1640.568    139.941 -11.723  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4549 on 8122 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.06936,    Adjusted R-squared:  0.06512 
## F-statistic: 16.36 on 37 and 8122 DF,  p-value: < 2.2e-16
##Model 4:

gmod1 <- glm(formula = TARGET_FLAG ~ .- INDEX - TARGET_AMT, family = binomial(link = "logit"), data = training)
summary(gmod1)
## 
## Call:
## glm(formula = TARGET_FLAG ~ . - INDEX - TARGET_AMT, family = binomial(link = "logit"), 
##     data = training)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.5850  -0.7127  -0.3983   0.6264   3.1526  
## 
## Coefficients:
##                                   Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                     -9.274e-01  3.215e-01  -2.885 0.003916 ** 
## KIDSDRIV                         3.862e-01  6.122e-02   6.308 2.82e-10 ***
## AGE                             -1.013e-03  4.020e-03  -0.252 0.801103    
## HOMEKIDS                         4.965e-02  3.713e-02   1.337 0.181171    
## YOJ                             -1.106e-02  8.582e-03  -1.289 0.197521    
## INCOME                          -3.421e-06  1.082e-06  -3.163 0.001562 ** 
## PARENT1Yes                       3.820e-01  1.096e-01   3.485 0.000492 ***
## HOME_VAL                        -1.307e-06  3.420e-07  -3.821 0.000133 ***
## MSTATUSz_No                      4.938e-01  8.358e-02   5.908 3.46e-09 ***
## SEXz_F                          -8.245e-02  1.120e-01  -0.736 0.461749    
## EDUCATIONBachelors              -3.794e-01  1.156e-01  -3.281 0.001034 ** 
## EDUCATIONMasters                -2.867e-01  1.787e-01  -1.604 0.108742    
## EDUCATIONPhD                    -1.641e-01  2.139e-01  -0.767 0.442943    
## EDUCATIONz_High School           1.799e-02  9.505e-02   0.189 0.849879    
## JOBClerical                      4.108e-01  1.967e-01   2.089 0.036709 *  
## JOBDoctor                       -4.457e-01  2.671e-01  -1.669 0.095163 .  
## JOBHome Maker                    2.325e-01  2.102e-01   1.106 0.268573    
## JOBLawyer                        1.050e-01  1.695e-01   0.619 0.535705    
## JOBManager                      -5.572e-01  1.715e-01  -3.248 0.001163 ** 
## JOBProfessional                  1.619e-01  1.784e-01   0.908 0.364065    
## JOBStudent                       2.163e-01  2.145e-01   1.008 0.313345    
## JOBz_Blue Collar                 3.107e-01  1.856e-01   1.674 0.094041 .  
## TRAVTIME                         1.457e-02  1.883e-03   7.736 1.03e-14 ***
## CAR_USEPrivate                  -7.564e-01  9.172e-02  -8.247  < 2e-16 ***
## BLUEBOOK                        -2.084e-05  5.263e-06  -3.960 7.50e-05 ***
## TIF                             -5.546e-02  7.344e-03  -7.553 4.27e-14 ***
## CAR_TYPEPanel Truck              5.607e-01  1.618e-01   3.466 0.000529 ***
## CAR_TYPEPickup                   5.540e-01  1.007e-01   5.500 3.81e-08 ***
## CAR_TYPESports Car               1.025e+00  1.299e-01   7.892 2.97e-15 ***
## CAR_TYPEVan                      6.185e-01  1.265e-01   4.890 1.01e-06 ***
## CAR_TYPEz_SUV                    7.682e-01  1.113e-01   6.904 5.06e-12 ***
## RED_CARyes                      -9.692e-03  8.636e-02  -0.112 0.910644    
## OLDCLAIM                        -1.389e-05  3.910e-06  -3.554 0.000380 ***
## CLM_FREQ                         1.960e-01  2.855e-02   6.865 6.66e-12 ***
## REVOKEDYes                       8.874e-01  9.133e-02   9.716  < 2e-16 ***
## MVR_PTS                          1.133e-01  1.361e-02   8.324  < 2e-16 ***
## CAR_AGE                         -1.080e-03  7.541e-03  -0.143 0.886064    
## URBANICITYz_Highly Rural/ Rural -2.390e+00  1.128e-01 -21.181  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 9418.0  on 8160  degrees of freedom
## Residual deviance: 7297.6  on 8123  degrees of freedom
## AIC: 7373.6
## 
## Number of Fisher Scoring iterations: 5
df <- 8123
deviance <- 7297.6
p_val <- pchisq(deviance, df = df, lower.tail = FALSE); p_val
## [1] 1
## Model 5:

gmod2 <- glm(formula = TARGET_FLAG ~ INCOME + YOJ + CLM_FREQ + MVR_PTS + BLUEBOOK, family = binomial(link = "logit"), data = training)
summary(gmod2)
## 
## Call:
## glm(formula = TARGET_FLAG ~ INCOME + YOJ + CLM_FREQ + MVR_PTS + 
##     BLUEBOOK, family = binomial(link = "logit"), data = training)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.7619  -0.7682  -0.6198   1.0140   2.6328  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -8.852e-01  8.506e-02 -10.407  < 2e-16 ***
## INCOME      -5.815e-06  7.193e-07  -8.084 6.25e-16 ***
## YOJ         -1.166e-02  6.613e-03  -1.763   0.0779 .  
## CLM_FREQ     2.859e-01  2.282e-02  12.525  < 2e-16 ***
## MVR_PTS      1.517e-01  1.229e-02  12.338  < 2e-16 ***
## BLUEBOOK    -1.538e-05  3.572e-06  -4.307 1.66e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 9418.0  on 8160  degrees of freedom
## Residual deviance: 8728.1  on 8155  degrees of freedom
## AIC: 8740.1
## 
## Number of Fisher Scoring iterations: 4
df <- 8155
deviance <- 8728.1
p_val <- pchisq(deviance, df = df, lower.tail = FALSE); p_val
## [1] 5.624213e-06
## Model 6:

gmod3 <- glm(formula = TARGET_FLAG ~ AGE + CLM_FREQ + HOME_VAL + INCOME + JOB + KIDSDRIV + MSTATUS + MVR_PTS + OLDCLAIM + RED_CAR + REVOKED + SEX + TIF + TRAVTIME + YOJ, family = binomial(link = "logit"), data = training)
summary(gmod3)
## 
## Call:
## glm(formula = TARGET_FLAG ~ AGE + CLM_FREQ + HOME_VAL + INCOME + 
##     JOB + KIDSDRIV + MSTATUS + MVR_PTS + OLDCLAIM + RED_CAR + 
##     REVOKED + SEX + TIF + TRAVTIME + YOJ, family = binomial(link = "logit"), 
##     data = training)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.0416  -0.7563  -0.5406   0.8256   2.7859  
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      -7.656e-01  2.444e-01  -3.133 0.001732 ** 
## AGE              -8.852e-03  3.306e-03  -2.677 0.007423 ** 
## CLM_FREQ          3.263e-01  2.688e-02  12.140  < 2e-16 ***
## HOME_VAL         -1.252e-06  3.259e-07  -3.841 0.000123 ***
## INCOME           -4.414e-06  9.788e-07  -4.510 6.49e-06 ***
## JOBClerical      -2.729e-01  1.455e-01  -1.875 0.060747 .  
## JOBDoctor        -9.512e-01  2.384e-01  -3.990 6.59e-05 ***
## JOBHome Maker    -4.717e-01  1.769e-01  -2.666 0.007686 ** 
## JOBLawyer        -5.707e-01  1.482e-01  -3.852 0.000117 ***
## JOBManager       -9.045e-01  1.488e-01  -6.078 1.22e-09 ***
## JOBProfessional  -4.326e-01  1.374e-01  -3.148 0.001645 ** 
## JOBStudent       -2.297e-01  1.693e-01  -1.356 0.174998    
## JOBz_Blue Collar  1.137e-01  1.303e-01   0.872 0.382955    
## KIDSDRIV          3.543e-01  4.921e-02   7.198 6.09e-13 ***
## MSTATUSz_No       4.886e-01  6.654e-02   7.343 2.09e-13 ***
## MVR_PTS           1.418e-01  1.286e-02  11.033  < 2e-16 ***
## OLDCLAIM         -1.421e-05  3.722e-06  -3.819 0.000134 ***
## RED_CARyes       -1.455e-02  8.127e-02  -0.179 0.857874    
## REVOKEDYes        1.004e+00  8.557e-02  11.731  < 2e-16 ***
## SEXz_F            6.642e-02  7.593e-02   0.875 0.381743    
## TIF              -4.768e-02  6.928e-03  -6.882 5.90e-12 ***
## TRAVTIME          6.000e-03  1.704e-03   3.522 0.000429 ***
## YOJ              -5.916e-03  7.775e-03  -0.761 0.446749    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 9418.0  on 8160  degrees of freedom
## Residual deviance: 8176.4  on 8138  degrees of freedom
## AIC: 8222.4
## 
## Number of Fisher Scoring iterations: 5
df <- 8138
deviance <- 8176.4
p_val <- pchisq(deviance, df = df, lower.tail = FALSE); p_val
## [1] 0.3798994
## Model 7:

gmod4 <- glm(formula = TARGET_FLAG ~ .- INDEX - TARGET_AMT, family = binomial(link = "logit"), data = trainingdata)
summary(gmod4)
## 
## Call:
## glm(formula = TARGET_FLAG ~ . - INDEX - TARGET_AMT, family = binomial(link = "logit"), 
##     data = trainingdata)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.3811  -0.7113  -0.4017   0.6225   3.1486  
## 
## Coefficients:
##                                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                      1.464062   0.649109   2.255 0.024102 *  
## KIDSDRIV                         0.707013   0.110763   6.383 1.74e-10 ***
## AGE                             -0.002552   0.004108  -0.621 0.534481    
## HOMEKIDS                         0.096804   0.083740   1.156 0.247677    
## YOJ                              0.018835   0.010844   1.737 0.082410 .  
## INCOME                          -0.093005   0.017662  -5.266 1.40e-07 ***
## PARENT1Yes                       0.333787   0.114598   2.913 0.003583 ** 
## HOME_VAL                        -0.029651   0.006898  -4.298 1.72e-05 ***
## MSTATUSz_No                      0.496122   0.087959   5.640 1.70e-08 ***
## SEXz_F                          -0.121475   0.108365  -1.121 0.262294    
## EDUCATIONBachelors              -0.405371   0.114887  -3.528 0.000418 ***
## EDUCATIONMasters                -0.346076   0.171257  -2.021 0.043301 *  
## EDUCATIONPhD                    -0.376881   0.201891  -1.867 0.061935 .  
## EDUCATIONz_High School           0.030946   0.095237   0.325 0.745228    
## JOBClerical                      0.524226   0.193680   2.707 0.006796 ** 
## JOBDoctor                       -0.386625   0.264350  -1.463 0.143590    
## JOBHome Maker                    0.130236   0.216936   0.600 0.548277    
## JOBLawyer                        0.174534   0.168488   1.036 0.300255    
## JOBManager                      -0.516765   0.170062  -3.039 0.002376 ** 
## JOBProfessional                  0.217652   0.177501   1.226 0.220125    
## JOBStudent                      -0.009498   0.223017  -0.043 0.966028    
## JOBz_Blue Collar                 0.391275   0.184633   2.119 0.034073 *  
## TRAVTIME                         0.429713   0.054342   7.908 2.63e-15 ***
## CAR_USEPrivate                  -0.759515   0.092103  -8.246  < 2e-16 ***
## BLUEBOOK                        -0.310406   0.058910  -5.269 1.37e-07 ***
## TIF                             -0.315921   0.041411  -7.629 2.37e-14 ***
## CAR_TYPEPanel Truck              0.480174   0.150107   3.199 0.001380 ** 
## CAR_TYPEPickup                   0.584159   0.100635   5.805 6.45e-09 ***
## CAR_TYPESports Car               1.041343   0.128113   8.128 4.35e-16 ***
## CAR_TYPEVan                      0.619988   0.125206   4.952 7.36e-07 ***
## CAR_TYPEz_SUV                    0.816889   0.107738   7.582 3.40e-14 ***
## RED_CARyes                      -0.013767   0.086307  -0.160 0.873264    
## OLDCLAIM                         0.026184   0.012384   2.114 0.034484 *  
## CLM_FREQ                         0.086246   0.043417   1.986 0.046980 *  
## REVOKEDYes                       0.698831   0.081492   8.575  < 2e-16 ***
## MVR_PTS                          0.281083   0.042153   6.668 2.59e-11 ***
## CAR_AGE                         -0.020588   0.046974  -0.438 0.661188    
## URBANICITYz_Highly Rural/ Rural -2.374662   0.113025 -21.010  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 9415.3  on 8159  degrees of freedom
## Residual deviance: 7287.1  on 8122  degrees of freedom
##   (1 observation deleted due to missingness)
## AIC: 7363.1
## 
## Number of Fisher Scoring iterations: 5
df <- 8122
deviance <- 7287.1
p_val <- pchisq(deviance, df = df, lower.tail = FALSE); p_val
## [1] 1

4. SELECT MODELS (25 Points)

Decide on the criteria for selecting the best multiple linear regression model and the best binary logistic regression model. Will you select models with slightly worse performance if it makes more sense or is more parsimonious? Discuss why you selected your models.

For the multiple linear regression model, will you use a metric such as Adjusted R2, RMSE, etc.? Be sure to explain how you can make inferences from the model, discuss multi-collinearity issues (if any), and discuss other relevant model output. Using the training data set, evaluate the multiple linear regression model based on (a) mean squared error, (b) R2, (c) F-statistic, and (d) residual plots.

For the binary logistic regression model, will you use a metric such as log likelihood, AIC, ROC curve, etc.? Using the training data set, evaluate the binary logistic regression model based on (a) accuracy, (b) classification error rate, (c) precision, (d) sensitivity, (e) specificity, (f) F1 score, (g) AUC, and (h) confusion matrix. Make predictions using the evaluation data set.

SELECT MODELS SOLUTION:

We run a couple of diagnostics on the selected multiple linear regression model 1 as well as on the other linear regression models devised in the prior step.

Residual vs Fitted Plots

The residual vs fitted plots for the baseline full model as well as for the three other models (reduced models 2, 3 and transformed model 4) are not homoscedastic.

Normality using Q-Q plots

Next, we can test the residuals for normality using the Q-Q plot. Residuals in the Q-Q plots for the full as well as reduced models do not follow the line approximately and hence the residuals are not normal. The skew in the Q-Q plot of the full and reduced models persists.

Identify leverage points using half-normal plots

Two index values 2904 and 5730 are identified as leverage points in the full model (model 1).

Identify influential points using Cook’s distance

We identify influential points in both the full as well as in the reduced and in the transformed models and removal of the largest of the influential points in each model doesn’t change the summary stats meaningfully. Model 1 adjusted R-squared improves slightly to 6.897% from 6.681%. Model 2 adjusted R-squared improves slightly to 2.578% from 2.567%. Model 3 adjusted R-squared declines slightly to 4.221% from 4.224% and model 4 adjusted R-squared improves slightly to 6.513% from 6.512%.

Based on the diagnostic plots and the summary analysis, I decide to still use model 1 after removing the influential points as my predictive model which has a F-stat value of ~17, adjusted R-squared of ~7% i.e. explains 7% of variance in the model. The coefficients are quite similar in both models.

This model can be written as:

TARGET_AMT = 0.049466 + (0.003448 * KIDSDRIV) + (0.548381 * AGE) + (0.189934 * HOMEKIDS) + (0.802093 * YOJ) + (0.010331 * INCOME) + (0.004485 * PARENT1Yes) + (0.363614 * HOME_VAL)+….

require(faraway)
## Loading required package: faraway
## 
## Attaching package: 'faraway'
## The following objects are masked from 'package:car':
## 
##     logit, vif
require(tidyverse)
## Loading required package: tidyverse
## ── Attaching packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.5     ✓ readr   1.3.1
## ✓ tibble  3.0.3     ✓ purrr   0.3.4
## ✓ tidyr   1.1.2     ✓ forcats 0.5.0
## ── Conflicts ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter()          masks stats::filter()
## x kableExtra::group_rows() masks dplyr::group_rows()
## x dplyr::lag()             masks stats::lag()
## x car::recode()            masks dplyr::recode()
## x MASS::select()           masks dplyr::select()
## x purrr::some()            masks car::some()
# Model 1:
par(mfrow = c(2, 2))
plot(lmod1)

#Plot of successive pairs of residuals to check for serial correlation
n1 <- length(residuals(lmod1))
plot(tail(residuals(lmod1), n1-1) ~ head(residuals(lmod1), n1-1), xlab = expression(hat(epsilon)[i]), ylab = expression(hat(epsilon)[i+1]))
abline(h = 0, v = 0, col = grey(0.75))

par(mfrow = c(2, 2))

#Check for leverage points using half-normal plots
hatv1 <- hatvalues(lmod1)
sum(hatv1)
## [1] 38
index <- row.names(training)
halfnorm(hatv1, labs = index, ylab = "Leverages")

## Identify influential points using Cook's distance
cookf1 <- cooks.distance(lmod1)
halfnorm(cookf1, 3, labs = index, ylab = "Cook's distances")

## Eliminating and re-running the full model
lmodf1 <- lm(TARGET_AMT ~ .-INDEX - TARGET_FLAG, training, subset = (cookf1 < max(cookf1)))
summary(lmodf1)
## 
## Call:
## lm(formula = TARGET_AMT ~ . - INDEX - TARGET_FLAG, data = training, 
##     subset = (cookf1 < max(cookf1)))
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -5804  -1680   -759    340  83455 
## 
## Coefficients:
##                                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                      1.059e+03  5.389e+02   1.965 0.049466 *  
## KIDSDRIV                         3.202e+02  1.095e+02   2.926 0.003448 ** 
## AGE                              4.102e+00  6.834e+00   0.600 0.548381    
## HOMEKIDS                         8.287e+01  6.321e+01   1.311 0.189934    
## YOJ                             -3.657e+00  1.459e+01  -0.251 0.802093    
## INCOME                          -4.479e-03  1.746e-03  -2.565 0.010331 *  
## PARENT1Yes                       5.555e+02  1.954e+02   2.843 0.004485 ** 
## HOME_VAL                        -5.192e-04  5.714e-04  -0.909 0.363614    
## MSTATUSz_No                      6.085e+02  1.401e+02   4.343 1.42e-05 ***
## SEXz_F                          -3.902e+02  1.778e+02  -2.195 0.028205 *  
## EDUCATIONBachelors              -2.768e+02  1.981e+02  -1.398 0.162294    
## EDUCATIONMasters                 2.381e+01  2.901e+02   0.082 0.934583    
## EDUCATIONPhD                     2.827e+02  3.443e+02   0.821 0.411560    
## EDUCATIONz_High School          -6.530e+01  1.663e+02  -0.393 0.694561    
## JOBClerical                      5.083e+02  3.302e+02   1.539 0.123820    
## JOBDoctor                       -5.464e+02  3.954e+02  -1.382 0.167001    
## JOBHome Maker                    3.187e+02  3.527e+02   0.903 0.366293    
## JOBLawyer                        1.760e+02  2.860e+02   0.615 0.538327    
## JOBManager                      -5.056e+02  2.790e+02  -1.812 0.070017 .  
## JOBProfessional                  3.540e+02  2.987e+02   1.185 0.236024    
## JOBStudent                       2.939e+02  3.617e+02   0.813 0.416489    
## JOBz_Blue Collar                 5.327e+02  3.113e+02   1.711 0.087117 .  
## TRAVTIME                         1.152e+01  3.117e+00   3.697 0.000220 ***
## CAR_USEPrivate                  -7.126e+02  1.591e+02  -4.480 7.57e-06 ***
## BLUEBOOK                         1.524e-02  8.341e-03   1.828 0.067615 .  
## TIF                             -4.526e+01  1.178e+01  -3.842 0.000123 ***
## CAR_TYPEPanel Truck              3.017e+02  2.691e+02   1.121 0.262190    
## CAR_TYPEPickup                   4.033e+02  1.651e+02   2.443 0.014594 *  
## CAR_TYPESports Car               1.029e+03  2.106e+02   4.884 1.06e-06 ***
## CAR_TYPEVan                      4.003e+02  2.063e+02   1.940 0.052365 .  
## CAR_TYPEz_SUV                    7.534e+02  1.734e+02   4.344 1.42e-05 ***
## RED_CARyes                      -8.938e+01  1.442e+02  -0.620 0.535353    
## OLDCLAIM                        -7.926e-03  7.193e-03  -1.102 0.270546    
## CLM_FREQ                         1.205e+02  5.323e+01   2.263 0.023656 *  
## REVOKEDYes                       5.420e+02  1.678e+02   3.229 0.001246 ** 
## MVR_PTS                          1.630e+02  2.508e+01   6.501 8.46e-11 ***
## CAR_AGE                         -2.215e+01  1.237e+01  -1.791 0.073401 .  
## URBANICITYz_Highly Rural/ Rural -1.665e+03  1.348e+02 -12.353  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4395 on 8122 degrees of freedom
## Multiple R-squared:  0.07319,    Adjusted R-squared:  0.06897 
## F-statistic: 17.33 on 37 and 8122 DF,  p-value: < 2.2e-16
summary(lmod1)
## 
## Call:
## lm(formula = TARGET_AMT ~ . - INDEX - TARGET_FLAG, data = training)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -5891  -1698   -760    344 103785 
## 
## Coefficients:
##                                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                      1.065e+03  5.572e+02   1.912  0.05595 .  
## KIDSDRIV                         3.142e+02  1.132e+02   2.777  0.00551 ** 
## AGE                              5.220e+00  7.066e+00   0.739  0.46005    
## HOMEKIDS                         7.764e+01  6.535e+01   1.188  0.23487    
## YOJ                             -3.952e+00  1.509e+01  -0.262  0.79336    
## INCOME                          -4.412e-03  1.805e-03  -2.444  0.01453 *  
## PARENT1Yes                       5.762e+02  2.020e+02   2.852  0.00435 ** 
## HOME_VAL                        -5.580e-04  5.908e-04  -0.945  0.34492    
## MSTATUSz_No                      5.701e+02  1.449e+02   3.935 8.38e-05 ***
## SEXz_F                          -3.688e+02  1.838e+02  -2.007  0.04481 *  
## EDUCATIONBachelors              -2.590e+02  2.048e+02  -1.265  0.20598    
## EDUCATIONMasters                 2.347e+01  2.999e+02   0.078  0.93764    
## EDUCATIONPhD                     2.863e+02  3.559e+02   0.804  0.42127    
## EDUCATIONz_High School          -8.888e+01  1.719e+02  -0.517  0.60520    
## JOBClerical                      5.293e+02  3.414e+02   1.550  0.12110    
## JOBDoctor                       -4.997e+02  4.087e+02  -1.223  0.22151    
## JOBHome Maker                    3.524e+02  3.646e+02   0.967  0.33380    
## JOBLawyer                        2.308e+02  2.956e+02   0.781  0.43498    
## JOBManager                      -4.787e+02  2.885e+02  -1.660  0.09704 .  
## JOBProfessional                  4.565e+02  3.088e+02   1.478  0.13936    
## JOBStudent                       2.876e+02  3.740e+02   0.769  0.44180    
## JOBz_Blue Collar                 5.077e+02  3.218e+02   1.577  0.11473    
## TRAVTIME                         1.195e+01  3.222e+00   3.709  0.00021 ***
## CAR_USEPrivate                  -7.798e+02  1.644e+02  -4.743 2.14e-06 ***
## BLUEBOOK                         1.433e-02  8.623e-03   1.662  0.09647 .  
## TIF                             -4.820e+01  1.218e+01  -3.958 7.63e-05 ***
## CAR_TYPEPanel Truck              2.648e+02  2.782e+02   0.952  0.34112    
## CAR_TYPEPickup                   3.754e+02  1.707e+02   2.200  0.02786 *  
## CAR_TYPESports Car               1.022e+03  2.178e+02   4.693 2.74e-06 ***
## CAR_TYPEVan                      5.145e+02  2.132e+02   2.413  0.01584 *  
## CAR_TYPEz_SUV                    7.518e+02  1.793e+02   4.193 2.78e-05 ***
## RED_CARyes                      -4.820e+01  1.491e+02  -0.323  0.74641    
## OLDCLAIM                        -1.057e-02  7.436e-03  -1.421  0.15527    
## CLM_FREQ                         1.417e+02  5.503e+01   2.575  0.01005 *  
## REVOKEDYes                       5.494e+02  1.735e+02   3.166  0.00155 ** 
## MVR_PTS                          1.753e+02  2.592e+01   6.765 1.43e-11 ***
## CAR_AGE                         -2.681e+01  1.279e+01  -2.097  0.03606 *  
## URBANICITYz_Highly Rural/ Rural -1.665e+03  1.394e+02 -11.943  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4544 on 8123 degrees of freedom
## Multiple R-squared:  0.07104,    Adjusted R-squared:  0.06681 
## F-statistic: 16.79 on 37 and 8123 DF,  p-value: < 2.2e-16
# Model 2:
par(mfrow = c(2, 2))

plot(lmod2)

#Plot of successive pairs of residuals to check for serial correlation
n2 <- length(residuals(lmod2))
plot(tail(residuals(lmod2), n1-1) ~ head(residuals(lmod2), n1-1), xlab = expression(hat(epsilon)[i]), ylab = expression(hat(epsilon)[i+1]))
abline(h = 0, v = 0, col = grey(0.75))

par(mfrow = c(2, 2))

#Check for leverage points using half-normal plots
hatv2 <- hatvalues(lmod2)
sum(hatv2)
## [1] 6
index <- row.names(training)
halfnorm(hatv2, labs = index, ylab = "Leverages")

## Identify influential points using Cook's distance
cookf2 <- cooks.distance(lmod2)
halfnorm(cookf2, 3, labs = index, ylab = "Cook's distances")

## Eliminating and re-running the full model
lmodf2 <- lm(TARGET_AMT ~ INCOME + YOJ + CLM_FREQ + MVR_PTS + BLUEBOOK, training, subset = (cookf2 < max(cookf2)))
summary(lmodf2)
## 
## Call:
## lm(formula = TARGET_AMT ~ INCOME + YOJ + CLM_FREQ + MVR_PTS + 
##     BLUEBOOK, data = training, subset = (cookf2 < max(cookf2)))
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -4223  -1509   -988   -338 104277 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.112e+03  1.707e+02   6.513 7.82e-11 ***
## INCOME      -5.765e-03  1.238e-03  -4.656 3.28e-06 ***
## YOJ         -6.384e+00  1.320e+01  -0.483   0.6288    
## CLM_FREQ     2.673e+02  4.757e+01   5.618 2.00e-08 ***
## MVR_PTS      2.381e+02  2.568e+01   9.273  < 2e-16 ***
## BLUEBOOK     1.196e-02  6.618e-03   1.807   0.0708 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4566 on 8154 degrees of freedom
## Multiple R-squared:  0.02638,    Adjusted R-squared:  0.02578 
## F-statistic: 44.18 on 5 and 8154 DF,  p-value: < 2.2e-16
summary(lmod2)
## 
## Call:
## lm(formula = TARGET_AMT ~ INCOME + YOJ + CLM_FREQ + MVR_PTS + 
##     BLUEBOOK, data = training)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -4328  -1519   -985   -361 104225 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.037e+03  1.735e+02   5.978 2.35e-09 ***
## INCOME      -5.763e-03  1.259e-03  -4.576 4.81e-06 ***
## YOJ         -3.854e+00  1.343e+01  -0.287   0.7741    
## CLM_FREQ     2.939e+02  4.836e+01   6.078 1.28e-09 ***
## MVR_PTS      2.336e+02  2.611e+01   8.947  < 2e-16 ***
## BLUEBOOK     1.471e-02  6.728e-03   2.186   0.0288 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4643 on 8155 degrees of freedom
## Multiple R-squared:  0.02627,    Adjusted R-squared:  0.02567 
## F-statistic:    44 on 5 and 8155 DF,  p-value: < 2.2e-16
# Model 3:
par(mfrow = c(2, 2))

plot(lmod3)

#Plot of successive pairs of residuals to check for serial correlation
n3 <- length(residuals(lmod3))
plot(tail(residuals(lmod3), n1-1) ~ head(residuals(lmod3), n1-1), xlab = expression(hat(epsilon)[i]), ylab = expression(hat(epsilon)[i+1]))
abline(h = 0, v = 0, col = grey(0.75))

par(mfrow = c(2, 2))

#Check for leverage points using half-normal plots
hatv3 <- hatvalues(lmod3)
sum(hatv3)
## [1] 23
index <- row.names(training)
halfnorm(hatv3, labs = index, ylab = "Leverages")

## Identify influential points using Cook's distance
cookf3 <- cooks.distance(lmod3)
halfnorm(cookf3, 3, labs = index, ylab = "Cook's distances")

## Eliminating and re-running the full model
lmodf3 <- lm(TARGET_AMT ~ AGE + CLM_FREQ + HOME_VAL + INCOME + JOB + KIDSDRIV + MSTATUS + MVR_PTS + OLDCLAIM + RED_CAR + REVOKED + SEX + TIF + TRAVTIME + YOJ, training, subset = (cookf3 < max(cookf3)))
summary(lmodf3)
## 
## Call:
## lm(formula = TARGET_AMT ~ AGE + CLM_FREQ + HOME_VAL + INCOME + 
##     JOB + KIDSDRIV + MSTATUS + MVR_PTS + OLDCLAIM + RED_CAR + 
##     REVOKED + SEX + TIF + TRAVTIME + YOJ, data = training, subset = (cookf3 < 
##     max(cookf3)))
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -4801  -1611   -889     66 104563 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       1.426e+03  4.502e+02   3.168 0.001539 ** 
## AGE              -3.716e+00  6.167e+00  -0.602 0.546893    
## CLM_FREQ          2.624e+02  5.368e+01   4.888 1.04e-06 ***
## HOME_VAL         -5.909e-04  5.869e-04  -1.007 0.314020    
## INCOME           -3.221e-03  1.670e-03  -1.928 0.053840 .  
## JOBClerical      -3.078e+02  2.687e+02  -1.146 0.251924    
## JOBDoctor        -8.823e+02  3.540e+02  -2.492 0.012717 *  
## JOBHome Maker    -5.283e+02  3.245e+02  -1.628 0.103562    
## JOBLawyer        -5.236e+02  2.603e+02  -2.012 0.044263 *  
## JOBManager       -8.871e+02  2.513e+02  -3.531 0.000417 ***
## JOBProfessional  -2.881e+02  2.496e+02  -1.154 0.248360    
## JOBStudent       -2.885e+02  3.158e+02  -0.914 0.360994    
## JOBz_Blue Collar  1.774e+02  2.430e+02   0.730 0.465472    
## KIDSDRIV          3.741e+02  9.912e+01   3.774 0.000162 ***
## MSTATUSz_No       6.761e+02  1.250e+02   5.408 6.54e-08 ***
## MVR_PTS           2.162e+02  2.569e+01   8.416  < 2e-16 ***
## OLDCLAIM         -1.030e-02  7.400e-03  -1.391 0.164114    
## RED_CARyes       -3.790e+00  1.483e+02  -0.026 0.979618    
## REVOKEDYes        7.558e+02  1.721e+02   4.391 1.14e-05 ***
## SEXz_F           -8.125e+01  1.389e+02  -0.585 0.558591    
## TIF              -4.561e+01  1.212e+01  -3.763 0.000169 ***
## TRAVTIME          4.829e+00  3.171e+00   1.523 0.127844    
## YOJ              -7.489e-01  1.468e+01  -0.051 0.959307    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4527 on 8137 degrees of freedom
## Multiple R-squared:  0.04479,    Adjusted R-squared:  0.04221 
## F-statistic: 17.34 on 22 and 8137 DF,  p-value: < 2.2e-16
summary(lmod3)
## 
## Call:
## lm(formula = TARGET_AMT ~ AGE + CLM_FREQ + HOME_VAL + INCOME + 
##     JOB + KIDSDRIV + MSTATUS + MVR_PTS + OLDCLAIM + RED_CAR + 
##     REVOKED + SEX + TIF + TRAVTIME + YOJ, data = training)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -4892  -1622   -885     73 104539 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       1.410e+03  4.578e+02   3.080 0.002076 ** 
## AGE              -2.536e+00  6.271e+00  -0.404 0.685992    
## CLM_FREQ          2.915e+02  5.456e+01   5.343 9.39e-08 ***
## HOME_VAL         -3.655e-04  5.966e-04  -0.613 0.540190    
## INCOME           -3.512e-03  1.698e-03  -2.068 0.038685 *  
## JOBClerical      -4.402e+02  2.731e+02  -1.612 0.107013    
## JOBDoctor        -1.025e+03  3.599e+02  -2.847 0.004428 ** 
## JOBHome Maker    -6.426e+02  3.299e+02  -1.948 0.051497 .  
## JOBLawyer        -6.608e+02  2.645e+02  -2.498 0.012517 *  
## JOBManager       -1.024e+03  2.554e+02  -4.010 6.13e-05 ***
## JOBProfessional  -4.283e+02  2.536e+02  -1.689 0.091353 .  
## JOBStudent       -3.898e+02  3.211e+02  -1.214 0.224887    
## JOBz_Blue Collar  3.704e+01  2.470e+02   0.150 0.880802    
## KIDSDRIV          4.057e+02  1.008e+02   4.026 5.73e-05 ***
## MSTATUSz_No       7.310e+02  1.271e+02   5.752 9.13e-09 ***
## MVR_PTS           2.119e+02  2.612e+01   8.114 5.60e-16 ***
## OLDCLAIM         -1.158e-02  7.524e-03  -1.538 0.123984    
## RED_CARyes       -5.204e+01  1.508e+02  -0.345 0.730041    
## REVOKEDYes        7.558e+02  1.751e+02   4.317 1.60e-05 ***
## SEXz_F           -1.174e+02  1.412e+02  -0.832 0.405685    
## TIF              -4.541e+01  1.233e+01  -3.684 0.000231 ***
## TRAVTIME          5.846e+00  3.224e+00   1.813 0.069861 .  
## YOJ               3.011e+00  1.492e+01   0.202 0.840115    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4604 on 8138 degrees of freedom
## Multiple R-squared:  0.04482,    Adjusted R-squared:  0.04224 
## F-statistic: 17.36 on 22 and 8138 DF,  p-value: < 2.2e-16
# Model 4:
par(mfrow = c(2, 2))

plot(lmod4)

#Plot of successive pairs of residuals to check for serial correlation
n4 <- length(residuals(lmod4))
plot(tail(residuals(lmod4), n1-1) ~ head(residuals(lmod4), n1-1), xlab = expression(hat(epsilon)[i]), ylab = expression(hat(epsilon)[i+1]))
abline(h = 0, v = 0, col = grey(0.75))

par(mfrow = c(2, 2))

#Check for leverage points using half-normal plots
hatv4 <- hatvalues(lmod4)
sum(hatv4)
## [1] 38
index <- row.names(training)
halfnorm(hatv4, labs = index, ylab = "Leverages")

## Identify influential points using Cook's distance
cookf4 <- cooks.distance(lmod4)
halfnorm(cookf4, 3, labs = index, ylab = "Cook's distances")

## Eliminating and re-running the full model
lmodf4 <- lm(TARGET_AMT ~ .- INDEX - TARGET_FLAG, data = trainingdata, subset = (cookf4 < max(cookf4)))
summary(lmodf4)
## 
## Call:
## lm(formula = TARGET_AMT ~ . - INDEX - TARGET_FLAG, data = trainingdata, 
##     subset = (cookf4 < max(cookf4)))
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -5526  -1693   -766    364 104244 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                      -372.058   1135.130  -0.328 0.743096    
## KIDSDRIV                          593.933    203.789   2.914 0.003573 ** 
## AGE                                 4.749      7.206   0.659 0.509880    
## HOMEKIDS                          168.655    145.221   1.161 0.245526    
## YOJ                                11.920     18.547   0.643 0.520448    
## INCOME                            -60.145     30.624  -1.964 0.049563 *  
## PARENT1Yes                        538.234    209.171   2.573 0.010095 *  
## HOME_VAL                          -14.153     12.557  -1.127 0.259709    
## MSTATUSz_No                       573.407    151.282   3.790 0.000152 ***
## SEXz_F                           -373.051    179.400  -2.079 0.037608 *  
## EDUCATIONBachelors               -354.357    203.752  -1.739 0.082045 .  
## EDUCATIONMasters                 -187.435    287.704  -0.651 0.514752    
## EDUCATIONPhD                      -64.913    334.505  -0.194 0.846136    
## EDUCATIONz_High School           -118.539    171.830  -0.690 0.490302    
## JOBClerical                       653.340    336.913   1.939 0.052512 .  
## JOBDoctor                        -490.400    409.139  -1.199 0.230713    
## JOBHome Maker                     425.780    370.236   1.150 0.250168    
## JOBLawyer                         285.024    295.321   0.965 0.334507    
## JOBManager                       -454.246    288.359  -1.575 0.115232    
## JOBProfessional                   507.436    308.378   1.645 0.099906 .  
## JOBStudent                        278.395    389.881   0.714 0.475216    
## JOBz_Blue Collar                  569.576    320.921   1.775 0.075966 .  
## TRAVTIME                          338.607     89.185   3.797 0.000148 ***
## CAR_USEPrivate                   -787.554    164.637  -4.784 1.75e-06 ***
## BLUEBOOK                          147.943    102.692   1.441 0.149722    
## TIF                              -288.710     71.646  -4.030 5.64e-05 ***
## CAR_TYPEPanel Truck               278.177    263.031   1.058 0.290277    
## CAR_TYPEPickup                    385.039    170.588   2.257 0.024026 *  
## CAR_TYPESports Car               1030.181    216.054   4.768 1.89e-06 ***
## CAR_TYPEVan                       503.527    212.446   2.370 0.017805 *  
## CAR_TYPEz_SUV                     758.838    174.246   4.355 1.35e-05 ***
## RED_CARyes                        -47.326    149.211  -0.317 0.751117    
## OLDCLAIM                           21.152     24.010   0.881 0.378358    
## CLM_FREQ                           67.856     85.740   0.791 0.428721    
## REVOKEDYes                        425.034    156.860   2.710 0.006750 ** 
## MVR_PTS                           420.522     76.698   5.483 4.31e-08 ***
## CAR_AGE                          -144.500     82.260  -1.757 0.079018 .  
## URBANICITYz_Highly Rural/ Rural -1638.965    139.992 -11.708  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4549 on 8121 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.06937,    Adjusted R-squared:  0.06513 
## F-statistic: 16.36 on 37 and 8121 DF,  p-value: < 2.2e-16
summary(lmod4)
## 
## Call:
## lm(formula = TARGET_AMT ~ . - INDEX - TARGET_FLAG, data = trainingdata)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -5524  -1692   -767    364 104245 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                      -381.997   1134.865  -0.337 0.736426    
## KIDSDRIV                          594.482    203.775   2.917 0.003540 ** 
## AGE                                 4.778      7.205   0.663 0.507297    
## HOMEKIDS                          169.279    145.208   1.166 0.243742    
## YOJ                                11.914     18.547   0.642 0.520633    
## INCOME                            -60.141     30.622  -1.964 0.049569 *  
## PARENT1Yes                        535.818    209.094   2.563 0.010408 *  
## HOME_VAL                          -14.259     12.554  -1.136 0.256058    
## MSTATUSz_No                       573.026    151.273   3.788 0.000153 ***
## SEXz_F                           -373.629    179.387  -2.083 0.037299 *  
## EDUCATIONBachelors               -354.551    203.742  -1.740 0.081862 .  
## EDUCATIONMasters                 -188.315    287.684  -0.655 0.512750    
## EDUCATIONPhD                      -65.879    334.482  -0.197 0.843866    
## EDUCATIONz_High School           -119.226    171.815  -0.694 0.487752    
## JOBClerical                       654.016    336.893   1.941 0.052255 .  
## JOBDoctor                        -490.708    409.119  -1.199 0.230397    
## JOBHome Maker                     426.700    370.213   1.153 0.249116    
## JOBLawyer                         284.819    295.306   0.964 0.334830    
## JOBManager                       -454.536    288.345  -1.576 0.114981    
## JOBProfessional                   507.414    308.363   1.646 0.099904 .  
## JOBStudent                        278.878    389.861   0.715 0.474427    
## JOBz_Blue Collar                  569.229    320.905   1.774 0.076130 .  
## TRAVTIME                          338.622     89.181   3.797 0.000148 ***
## CAR_USEPrivate                   -786.794    164.620  -4.779 1.79e-06 ***
## BLUEBOOK                          148.802    102.670   1.449 0.147286    
## TIF                              -288.216     71.635  -4.023 5.79e-05 ***
## CAR_TYPEPanel Truck               278.092    263.019   1.057 0.290402    
## CAR_TYPEPickup                    385.536    170.576   2.260 0.023835 *  
## CAR_TYPESports Car               1030.863    216.038   4.772 1.86e-06 ***
## CAR_TYPEVan                       503.431    212.436   2.370 0.017821 *  
## CAR_TYPEz_SUV                     758.477    174.236   4.353 1.36e-05 ***
## RED_CARyes                        -47.304    149.203  -0.317 0.751220    
## OLDCLAIM                           21.153     24.008   0.881 0.378306    
## CLM_FREQ                           67.172     85.722   0.784 0.433296    
## REVOKEDYes                        425.330    156.851   2.712 0.006709 ** 
## MVR_PTS                           421.208     76.679   5.493 4.07e-08 ***
## CAR_AGE                          -143.949     82.247  -1.750 0.080121 .  
## URBANICITYz_Highly Rural/ Rural -1640.568    139.941 -11.723  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4549 on 8122 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.06936,    Adjusted R-squared:  0.06512 
## F-statistic: 16.36 on 37 and 8122 DF,  p-value: < 2.2e-16

Evaluating the logit models

We decide to use model 7 as our final logit model as it has the least AIC value (7363.1). Furthermore, model 7 also has the highest accuracy (0.78), the lowest classification error rate (0.22), is tied on precision with Model 4 (0.783), ranks low on sensitivity (0.969) but has the highest specificity (0.252) and F1 (0.866) values.

Please note I had to look around for some help on the calculation of the confusion matrix and ROC curves.

Based on these diagnostics, I decided to use model 7 to predict results using the evaluation dataset.

Confusion matrix:

library(caret)
## Warning: package 'caret' was built under R version 4.0.5
## Loading required package: lattice
## 
## Attaching package: 'lattice'
## The following object is masked from 'package:faraway':
## 
##     melanoma
## 
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
## 
##     lift
library(kableExtra)
library(dplyr)

formula(gmod1)
## TARGET_FLAG ~ (INDEX + TARGET_AMT + KIDSDRIV + AGE + HOMEKIDS + 
##     YOJ + INCOME + PARENT1 + HOME_VAL + MSTATUS + SEX + EDUCATION + 
##     JOB + TRAVTIME + CAR_USE + BLUEBOOK + TIF + CAR_TYPE + RED_CAR + 
##     OLDCLAIM + CLM_FREQ + REVOKED + MVR_PTS + CAR_AGE + URBANICITY) - 
##     INDEX - TARGET_AMT
formula(gmod2)
## TARGET_FLAG ~ INCOME + YOJ + CLM_FREQ + MVR_PTS + BLUEBOOK
formula(gmod3)
## TARGET_FLAG ~ AGE + CLM_FREQ + HOME_VAL + INCOME + JOB + KIDSDRIV + 
##     MSTATUS + MVR_PTS + OLDCLAIM + RED_CAR + REVOKED + SEX + 
##     TIF + TRAVTIME + YOJ
formula(gmod4)
## TARGET_FLAG ~ (INDEX + TARGET_AMT + KIDSDRIV + AGE + HOMEKIDS + 
##     YOJ + INCOME + PARENT1 + HOME_VAL + MSTATUS + SEX + EDUCATION + 
##     JOB + TRAVTIME + CAR_USE + BLUEBOOK + TIF + CAR_TYPE + RED_CAR + 
##     OLDCLAIM + CLM_FREQ + REVOKED + MVR_PTS + CAR_AGE + URBANICITY) - 
##     INDEX - TARGET_AMT
preds1 = predict(gmod1, newdata = training)
preds2 = predict(gmod2, newdata = training)
preds3 = predict(gmod3, newdata = training)
preds4 = predict(gmod4, newdata = trainingdata)

preds1[preds1 >= 0.5] <- 1
preds1[preds1 < 0.5] <- 0
preds1 = as.factor(preds1)

preds2[preds2 >= 0.5] <- 1
preds2[preds2 < 0.5] <- 0
preds2 = as.factor(preds2)

preds3[preds3 >= 0.5] <- 1
preds3[preds3 < 0.5] <- 0
preds3 = as.factor(preds3)

preds4[preds4 >= 0.5] <- 1
preds4[preds4 < 0.5] <- 0
preds4[is.na(preds4)] <- 0
preds4 = as.factor(preds4)

m1 <- confusionMatrix(preds1, as.factor(training$TARGET_FLAG), mode = "everything")
m2 <- confusionMatrix(preds2, as.factor(training$TARGET_FLAG), mode = "everything")
m3 <- confusionMatrix(preds3, as.factor(training$TARGET_FLAG), mode = "everything")
m4 <- confusionMatrix(preds4, as.factor(trainingdata$TARGET_FLAG), mode = "everything")

temp <- data.frame(m1$overall, 
                   m2$overall, 
                   m3$overall,
                   m4$overall) %>%
  t() %>%
  data.frame() %>%
  dplyr::select(Accuracy) %>%
  mutate(Classification_Error_Rate = 1-Accuracy)
Summ_Stat <-data.frame(m1$byClass, 
                   m2$byClass, 
                   m3$byClass,
                   m4$byClass) %>%
  t() %>%
  data.frame() %>%
  cbind(temp) %>%
  mutate(Model = c("Model 1", "Model 2", "Model 3", "Model 4")) %>%
  dplyr::select(Model, Accuracy, Classification_Error_Rate, Precision, Sensitivity, Specificity, F1) %>%
  mutate_if(is.numeric, round,3) %>%
  kable('html', escape = F) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),full_width = F)

Summ_Stat
Model Accuracy Classification_Error_Rate Precision Sensitivity Specificity F1
Model 1 0.780 0.220 0.783 0.969 0.252 0.866
Model 2 0.742 0.258 0.743 0.994 0.041 0.850
Model 3 0.755 0.245 0.757 0.983 0.120 0.855
Model 4 0.778 0.222 0.783 0.967 0.251 0.865
getROC <- function(model) {
    name <- deparse(substitute(model))
    pred.prob1 <- predict(model, newdata = training)
    p1 <- data.frame(pred = training$TARGET_FLAG, prob = pred.prob1)
    p1 <- p1[order(p1$prob),]
    rocobj <- pROC::roc(p1$pred, p1$prob)
    plot(rocobj, asp=NA, legacy.axes = TRUE, print.auc=TRUE,
         xlab="Specificity", main = name)
}

getROC1 <- function(model) {
    name <- deparse(substitute(model))
    pred.prob1 <- predict(model, newdata = trainingdata)
    p1 <- data.frame(pred = trainingdata$TARGET_FLAG, prob = pred.prob1)
    p1 <- p1[order(p1$prob),]
    rocobj <- pROC::roc(p1$pred, p1$prob)
    plot(rocobj, asp=NA, legacy.axes = TRUE, print.auc=TRUE,
         xlab="Specificity", main = name)
}

par(mfrow=c(3,3))

getROC(gmod1)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
getROC(gmod2)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
getROC(gmod3)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
getROC1(gmod4)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases

Predicting using the eval dataset

We use the logit model 7 (gmod4) to predict TARGET_FLAG and then use lmodf1 (The basic linear model after removing influential points) to predict TARGET_AMT.

#Convert variables to numeric class
eval$TARGET_AMT <- as.numeric(gsub('[$,]', '', eval$TARGET_AMT))
eval$INCOME <- as.numeric(gsub('[$,]', '', eval$INCOME))
eval$HOME_VAL <- as.numeric(gsub('[$,]', '', eval$HOME_VAL))
eval$BLUEBOOK <- as.numeric(gsub('[$,]', '', eval$BLUEBOOK))
eval$OLDCLAIM <- as.numeric(gsub('[$,]', '', eval$OLDCLAIM))

# We use the logit model 7 (gmod4) to predict TARGET_FLAG and then use lmodf1 (The basic linear model after removing influential points) to predict TARGET_AMT and write to an output file.

eval_probs <- predict(gmod4, newdata = eval, type='response')
eval$TARGET_FLAG <- ifelse(eval_probs > 0.5, 1, 0)
eval$TARGET_AMT <- 0.0
eval[which(eval$TARGET_FLAG == 1),]$TARGET_AMT <- predict(lmodf1, newdata = eval %>% filter(TARGET_FLAG == 1))

write.csv(eval, file = '/Users/tponnada/Downloads/predictHW4.csv')