Overview

In this homework assignment, you will explore, analyze and model a data set containing approximately 8000 records representing a customer at an auto insurance company. Each record has two response variables. The first response variable, TARGET_FLAG, is a 1 or a 0. A “1” means that the person was in a car crash. A zero means that the person was not in a car crash. The second response variable is TARGET_AMT. This value is zero if the person did not crash their car. But if they did crash their car, this number will be a value greater than zero.

Your objective is to build multiple linear regression and binary logistic regression models on the training data to predict the probability that a person will crash their car and also the amount of money it will cost if the person does crash their car. You can only use the variables given to you (or variables that you derive from the variables provided). Below is a short description of the variables of interest in the data set:

VARIABLE DESCRIPTION THEORETICAL EFFECT
INDEX Identification Variable (do not use) None
TARGET_FLAG Was Car in a crash? 1=YES 0=NO None
TARGET_AMT If car was in a crash, what was the cost None
AGE Age of Driver Very young people tend to be risky; maybe very old people also
BLUEBOOK Value of Vehicle Unknown effect, but probably effect the payout if there is a crash
CAR_AGE Vehicle Age Unknown effect, but probably effect the payout if there is a crash
CAR_TYPE Type of Car Unknown effect, but probably effect the payout if there is a crash
CAR_USE Vehicle Use Commercial vehicles are driven more, so might increase probability of collision
CLM_FREQ # Claims (Past 5 Years) The more claims filed in the past, the more likely to file in the future
EDUCATION Max Education Level Unknown effect, but in theory more educated people tend to drive more safely
HOMEKIDS # Children at Home Unknown effect
HOME_VAL Home Value In theory, home owners tend to drive more responsibly
INCOME Income In theory, rich people tend to get into fewer crashes
JOB Job Category In theory, white collar jobs tend to be safer
KIDSDRIV # Driving Children When teenagers drive a car, they are more likely to get into crashes
MSTATUS Marital Status In theory, married people drive more safely
MVR_PTS Motor Vehicle Record Points Drivers with lots of traffic tickets tend to get into more crashes
OLDCLAIM Total Claims Past 5 Years A high total payout over past five years suggests high future payouts
PARENT1 Single Parent Unknown effect
RED_CAR A Red Car Urban legend says that red cars (especially red sports cars) are more risky
REVOKED License Revoked Past 7 Years If license revoked in past 7 years, driver is probably more risky
SEX Gender Urban legend says that women have less crashes then men
TIF Time in Force People who have been customers for a long time are usually more safe
TRAVTIME Distance to Work Long drives to work usually suggest greater risk
URBANICITY Home/Work Area Unknown
YOJ Years on Job People who stay at a job for a long time are usually more safe

A write-up submitted in PDF format. Your write-up should have four sections. Each one is described below. You may assume you are addressing me as a fellow data scientist, so do not need to shy away from technical details. Assign predictions (probabilities, classifications, cost) to the evaluation data set. Use 0.5 threshold.

training <- read.csv(paste0("https://raw.githubusercontent.com/jzuniga123/",
               "SPS/master/DATA%20621/insurance_training_data.csv"), na.strings = "")

evaluation <- read.csv(paste0("https://raw.githubusercontent.com/jzuniga123/",
               "SPS/master/DATA%20621/insurance-evaluation-data.csv"), na.strings = "")

M <- rbind(training, evaluation) # Merged
n <- nrow(training); # training is M[1:n, 4:26]
m <- nrow(evaluation) # evaluation is M[(1+n):(m+n), ]
X <- data.frame("TARGET_FLAG" = rep(T, ncol(M)), 
                "TARGET_AMT" = rep(T, ncol(M)))
X[match(c("INDEX", "TARGET_AMT"), names(M)), "TARGET_FLAG"] <- F
X[match(c("INDEX", "TARGET_FLAG"), names(M)), "TARGET_AMT"] <- F

Normalize Data

quantitative <- c(4:8, 10, 15, 17, 18, 21, 22, 24, 25)
names(M[quantitative])
##  [1] "KIDSDRIV" "AGE"      "HOMEKIDS" "YOJ"      "INCOME"   "HOME_VAL"
##  [7] "TRAVTIME" "BLUEBOOK" "TIF"      "OLDCLAIM" "CLM_FREQ" "MVR_PTS" 
## [13] "CAR_AGE"
categorical <- c(13, 14, 19)
names(M[categorical])
## [1] "EDUCATION" "JOB"       "CAR_TYPE"
binary <- c(9, 11, 12, 16, 20, 23, 26)
names(M[binary])
## [1] "PARENT1"    "MSTATUS"    "SEX"        "CAR_USE"    "RED_CAR"   
## [6] "REVOKED"    "URBANICITY"
Currency_Convert <- function(Field){
  Field <- as.numeric(gsub("\\$|,","", Field))
}

Binary_Convert <- function(Field, Neg, Pos) {
  Field <- as.character(Field)
  Field[which(Field == Neg)] <- 0
  Field[which(Field == Pos)] <- 1
  Field <- as.numeric(Field)
}

M$INCOME <- Currency_Convert(M$INCOME)
M$PARENT1 <- Binary_Convert(M$PARENT1, "No", "Yes")
M$HOME_VAL <- Currency_Convert(M$HOME_VAL)
M$MSTATUS <- Binary_Convert(M$MSTATUS, "z_No", "Yes")
M$SEX <- Binary_Convert(M$SEX, "M", "z_F")
M$CAR_USE <- Binary_Convert(M$CAR_USE, "Commercial", "Private")
M$BLUEBOOK <- Currency_Convert(M$BLUEBOOK)
M$RED_CAR <- Binary_Convert(M$RED_CAR, "no", "yes")
M$OLDCLAIM <- Currency_Convert(M$OLDCLAIM)
M$REVOKED <- Binary_Convert(M$REVOKED, "No", "Yes")
M$URBANICITY <- Binary_Convert(M$URBANICITY, "z_Highly Rural/ Rural", "Highly Urban/ Urban")

M$CAR_AGE[which(M$CAR_AGE < 0)] <- NA
M$HOME_VAL[which(M$HOME_VAL == 0)] <- NA

Several variables use currency formatting or are categorical in nature. These were converted to numeric and binary, respectively.

Data Exploration

Describe the size and the variables in the insurance training data set. Consider that too much detail will cause a manager to lose interest while too little detail will make the manager consider that you aren’t doing your job. Some suggestions are given below. Please do NOT treat this as a check list of things to do to complete the assignment. You should have your own thoughts on what to tell the boss. These are just ideas.

  1. Mean / Standard Deviation / Median
  2. Bar Chart or Box Plot of the data
  3. Is the data correlated to the target variable (or to other variables?)
  4. Are any of the variables missing and need to be imputed/“fixed”?
library("DT")
display <- function(data) {
  datatable(data, options = list(
    searching = TRUE,
    pageLength = 5,
    lengthMenu = c(5, nrow(data))
    ), rownames = FALSE)
}

Numerical Summaries

all(complete.cases(M[1:n, -1]))
## [1] FALSE
summary(M[1:n, -1])
##   TARGET_FLAG       TARGET_AMT        KIDSDRIV           AGE       
##  Min.   :0.0000   Min.   :     0   Min.   :0.0000   Min.   :16.00  
##  1st Qu.:0.0000   1st Qu.:     0   1st Qu.:0.0000   1st Qu.:39.00  
##  Median :0.0000   Median :     0   Median :0.0000   Median :45.00  
##  Mean   :0.2638   Mean   :  1504   Mean   :0.1711   Mean   :44.79  
##  3rd Qu.:1.0000   3rd Qu.:  1036   3rd Qu.:0.0000   3rd Qu.:51.00  
##  Max.   :1.0000   Max.   :107586   Max.   :4.0000   Max.   :81.00  
##                                                     NA's   :6      
##     HOMEKIDS           YOJ           INCOME          PARENT1     
##  Min.   :0.0000   Min.   : 0.0   Min.   :     0   Min.   :0.000  
##  1st Qu.:0.0000   1st Qu.: 9.0   1st Qu.: 28097   1st Qu.:0.000  
##  Median :0.0000   Median :11.0   Median : 54028   Median :0.000  
##  Mean   :0.7212   Mean   :10.5   Mean   : 61898   Mean   :0.132  
##  3rd Qu.:1.0000   3rd Qu.:13.0   3rd Qu.: 85986   3rd Qu.:0.000  
##  Max.   :5.0000   Max.   :23.0   Max.   :367030   Max.   :1.000  
##                   NA's   :454    NA's   :445                     
##     HOME_VAL         MSTATUS            SEX                 EDUCATION   
##  Min.   : 50223   Min.   :0.0000   Min.   :0.0000   <High School :1203  
##  1st Qu.:153074   1st Qu.:0.0000   1st Qu.:0.0000   Bachelors    :2242  
##  Median :206692   Median :1.0000   Median :1.0000   Masters      :1658  
##  Mean   :220621   Mean   :0.5997   Mean   :0.5361   PhD          : 728  
##  3rd Qu.:270023   3rd Qu.:1.0000   3rd Qu.:1.0000   z_High School:2330  
##  Max.   :885282   Max.   :1.0000   Max.   :1.0000                       
##  NA's   :2758                                                           
##             JOB          TRAVTIME         CAR_USE          BLUEBOOK    
##  z_Blue Collar:1825   Min.   :  5.00   Min.   :0.0000   Min.   : 1500  
##  Clerical     :1271   1st Qu.: 22.00   1st Qu.:0.0000   1st Qu.: 9280  
##  Professional :1117   Median : 33.00   Median :1.0000   Median :14440  
##  Manager      : 988   Mean   : 33.49   Mean   :0.6288   Mean   :15710  
##  Lawyer       : 835   3rd Qu.: 44.00   3rd Qu.:1.0000   3rd Qu.:20850  
##  (Other)      :1599   Max.   :142.00   Max.   :1.0000   Max.   :69740  
##  NA's         : 526                                                    
##       TIF                CAR_TYPE       RED_CAR          OLDCLAIM    
##  Min.   : 1.000   Minivan    :2145   Min.   :0.0000   Min.   :    0  
##  1st Qu.: 1.000   Panel Truck: 676   1st Qu.:0.0000   1st Qu.:    0  
##  Median : 4.000   Pickup     :1389   Median :0.0000   Median :    0  
##  Mean   : 5.351   Sports Car : 907   Mean   :0.2914   Mean   : 4037  
##  3rd Qu.: 7.000   Van        : 750   3rd Qu.:1.0000   3rd Qu.: 4636  
##  Max.   :25.000   z_SUV      :2294   Max.   :1.0000   Max.   :57037  
##                                                                      
##     CLM_FREQ         REVOKED          MVR_PTS          CAR_AGE     
##  Min.   :0.0000   Min.   :0.0000   Min.   : 0.000   Min.   : 0.00  
##  1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.: 0.000   1st Qu.: 1.00  
##  Median :0.0000   Median :0.0000   Median : 1.000   Median : 8.00  
##  Mean   :0.7986   Mean   :0.1225   Mean   : 1.696   Mean   : 8.33  
##  3rd Qu.:2.0000   3rd Qu.:0.0000   3rd Qu.: 3.000   3rd Qu.:12.00  
##  Max.   :5.0000   Max.   :1.0000   Max.   :13.000   Max.   :28.00  
##                                                     NA's   :511    
##    URBANICITY    
##  Min.   :0.0000  
##  1st Qu.:1.0000  
##  Median :1.0000  
##  Mean   :0.7955  
##  3rd Qu.:1.0000  
##  Max.   :1.0000  
## 

Looking at the data summaries for the training dataset, we can see that several variables have notable amounts of NA’s, with YOJ and CAR_AGE among the highest. We can also see multiple skewed variables with large maximums relative to their mean and median.

Variable Scatterplots

par(mfrow = c(4,4), cex=.4)
for (i in c(quantitative, categorical)) { 
  plot(M[1:n, i], main = names(M[i])) 
}

Not many of the nonbinary variables appear to approach normality.

Scatterplot Matrix

plot(M[1:n, c(quantitative, categorical)])

Although it is somewhat hard to see here, relationships exist between many of the quantitative variables.

Histograms, Density Plots

par(mfrow = c(2,4))
for (i in c(quantitative, categorical)) {
  if(is.numeric(M[1:n, i])) {
    hist(M[1:n, i], xlab = names(M[i]), main = names(M[i]))
    d <- density(M[1:n, i], na.rm=T)
  }
  else {
    plot(M[1:n, i], xlab = names(M[i]), main = names(M[i]))
    d <- density(as.numeric(M[1:n, i]), na.rm=T)
  }
  plot(d, main = names(M[i]))
  polygon(d, col="red")
}

The histograms and density plots give a better understanding of how the nonbinary data are distributed. The distribution of the variable AGE could pass for some variation of a Gaussian or Student-\(t\) distribution. The variables INCOME, HOME_VAL, TRAVTIME, BLUEBOOK, TIF, and MVR_PTS show heavy skewing. The variables KIDSDRIV, HOMEKIDS, YOJ, OLDCLAIM, CLM_FREQ, and CAR_AGE have multimodal distributions.

Correlation Heatmap

library(ggplot2)
library(reshape2)
ggplot(data = melt(abs(cor(sapply(na.omit(training), as.numeric)))), aes(x=Var1, y=Var2, fill=value)) +
  scale_fill_gradient(low = 'black', high = 'red', name = "Absolute Value") +
  geom_tile() + labs(title = "Correlation Heatmap") +
  theme(axis.title.y = element_blank(),
        axis.title.x = element_blank(),
        axis.text.x = element_text(angle = 90, hjust = 1),
        plot.title = element_text(hjust = 0.5))

There is some correlation visible between the predictor variables and the response variables, but not very much. The noteable correlations are those between the predictor variables themselves: The variable MVR_PTS is correlated with CLM_FREQ and OLD_CLM; RED_CAR is correlated with SEX and CAR_TYPE; CAR_USE is correlated with JOB; MSTATUS is correlated with PARENT1 and HOME_VAL; and HOMEKIDS is correlated with AGE and PARENT1.

Principle Component Analysis

PCA <- function(X) {
  Xpca <- prcomp(na.omit(X), center = T, scale. = T) 
  M <- as.matrix(na.omit(X)); R <- as.matrix(Xpca$rotation); score <- M %*% R
  print(list("Importance of Components" = summary(Xpca)$importance[ ,1:5], 
             "Rotation (Variable Loadings)" = Xpca$rotation[ ,1:5],
             "Correlation between X and PC" = cor(na.omit(X), score)[ ,1:5]))
  par(mfrow=c(2,3))
  barplot(Xpca$sdev^2, ylab = "Component Variance")
  barplot(cor(cbind(X)), ylab = "Correlations")
  barplot(Xpca$rotation, ylab = "Loadings")  
  biplot(Xpca); barplot(M); barplot(score)
}
PCA(M[1:n, quantitative])
## $`Importance of Components`
##                             PC1      PC2      PC3      PC4       PC5
## Standard deviation     1.645244 1.334995 1.270056 1.012738 0.9938183
## Proportion of Variance 0.208220 0.137090 0.124080 0.078900 0.0759700
## Cumulative Proportion  0.208220 0.345310 0.469390 0.548290 0.6242600
## 
## $`Rotation (Variable Loadings)`
##                  PC1         PC2         PC3         PC4          PC5
## KIDSDRIV  0.10774567 -0.17584383  0.50530195 -0.07331372  0.083069165
## AGE      -0.26771011  0.02800193 -0.33747972  0.09204957  0.398329669
## HOMEKIDS  0.22780686 -0.15800486  0.61071654 -0.03742552 -0.070548997
## YOJ      -0.14018772 -0.15622254  0.30537124  0.16303338  0.482290631
## INCOME   -0.53361654 -0.15978248  0.16457623 -0.05878797 -0.111702800
## HOME_VAL -0.54258303 -0.15297470  0.13345360 -0.05556924 -0.070094419
## TRAVTIME  0.03412860  0.01807301 -0.03461120 -0.66695073  0.612187440
## BLUEBOOK -0.34153090 -0.09792299  0.09498366 -0.05047749  0.005257043
## TIF       0.01113475  0.06077082  0.08346442  0.70417523  0.407479904
## OLDCLAIM  0.10846886 -0.53950381 -0.18816292  0.08137372  0.027275343
## CLM_FREQ  0.12435306 -0.57054584 -0.21337557  0.02350619  0.041684584
## MVR_PTS   0.12635248 -0.48204768 -0.13797155 -0.02119042 -0.055156727
## CAR_AGE  -0.32388952 -0.08617776 -0.04999174  0.02775244 -0.181923195
## 
## $`Correlation between X and PC`
##                  PC1          PC2         PC3         PC4         PC5
## KIDSDRIV  0.03279334  0.026553825 -0.03415518  0.03430623  0.03304671
## AGE      -0.21530718 -0.207800075  0.21108534 -0.21467167 -0.20309336
## HOMEKIDS  0.14930487  0.138457016 -0.14945042  0.15094329  0.14520501
## YOJ      -0.20724065 -0.209543450  0.20479944 -0.20430965 -0.20661069
## INCOME   -0.98252529 -0.961368639  0.98208694 -0.97955171 -0.98781092
## HOME_VAL -0.99539680 -0.973818412  0.99092055 -0.99071401 -0.99286141
## TRAVTIME  0.03450769  0.037350720 -0.03374552  0.03207236  0.03588214
## BLUEBOOK -0.43933895 -0.428786579  0.44044267 -0.45038023 -0.40712199
## TIF       0.02086626  0.026174257 -0.01892478  0.01915871  0.02069475
## OLDCLAIM  0.06486343 -0.155449228 -0.13116239  0.13800364  0.07289718
## CLM_FREQ  0.07562867 -0.034894080 -0.10831923  0.11178715  0.07912375
## MVR_PTS   0.07488695  0.009564432 -0.09370134  0.09613452  0.07548462
## CAR_AGE  -0.37196326 -0.364430487  0.37128491 -0.37062402 -0.37290793

The correlation matrix between the original variables and the principal components indicates that the majority of the variance in the data is coming from, in order of significance, HOME_VAL and INCOME. To a much lesser extent, BLUEBOOK and CAR_AGE are also impacting the variance.

Data Preparation

Describe how you have transformed the data by changing the original variables or creating new variables. If you did transform the data or create new variables, discuss why you did this. Here are some possible transformations.

  1. Fix missing values (maybe with a Mean or Median value)
  2. Create flags to suggest if a variable was missing
  3. Transform data by putting it into buckets
  4. Mathematical transforms such as log or square root (or, use Box-Cox)
  5. Combine variables (such as ratios or adding or multiplying) to create new variables

Missing Value Imputation

library(VIM)
all(complete.cases(training))
## [1] FALSE
aggr(M[1:n, 4:26], bars=F, sortVars=T)

## 
##  Variables sorted by number of missings: 
##    Variable       Count
##    HOME_VAL 0.337948781
##         JOB 0.064452886
##     CAR_AGE 0.062614876
##         YOJ 0.055630437
##      INCOME 0.054527631
##         AGE 0.000735204
##    KIDSDRIV 0.000000000
##    HOMEKIDS 0.000000000
##     PARENT1 0.000000000
##     MSTATUS 0.000000000
##         SEX 0.000000000
##   EDUCATION 0.000000000
##    TRAVTIME 0.000000000
##     CAR_USE 0.000000000
##    BLUEBOOK 0.000000000
##         TIF 0.000000000
##    CAR_TYPE 0.000000000
##     RED_CAR 0.000000000
##    OLDCLAIM 0.000000000
##    CLM_FREQ 0.000000000
##     REVOKED 0.000000000
##     MVR_PTS 0.000000000
##  URBANICITY 0.000000000

We have notable amounts of missing values in the HOME_VAL, JOB, CAR_AGE, YOJ, and INCOME variables. There is also a small amount of missing values in AGE.

Impute Jobs from Education

Likely_Value <- function(Field_1, Field_2, Value) {
  # Mode for Field_1 for given Value of Field_2
  frequencies <- table(Field_1[which(Field_2 == Value)])
  most_frequent <- names(sort(frequencies, decreasing = TRUE)[1])
  return(most_frequent)
}

M$JOB[(is.na(M$JOB) & M$EDUCATION == "PhD")] <- Likely_Value(M$JOB, M$EDUCATION, "PhD")
M$JOB[(is.na(M$JOB) & M$EDUCATION == "Masters")] <- Likely_Value(M$JOB, M$EDUCATION, "Masters")
M$JOB[(is.na(M$JOB) & M$EDUCATION == "Bachelors")] <- Likely_Value(M$JOB, M$EDUCATION, "Bachelors")
M$JOB[(is.na(M$JOB) & M$EDUCATION == "z_High School")] <- Likely_Value(M$JOB, M$EDUCATION, "z_High School")
M$JOB[(is.na(M$JOB) & M$EDUCATION == "<High School")] <- Likely_Value(M$JOB, M$EDUCATION, "<High School")

Assuming that education level can serve as a reasonable proxy for a person’s job, the likely JOB value for each sample given the EDUCATION value level is imputed for missing JOB values by looking at the predominant (mode) JOB value for each EDUCATION value. For example, if education level E is mostly employee in job J, then where there exists and education level E without missing job information, we assume job J.

Impute Using Mice Package

library(mice)
## Warning: package 'mice' was built under R version 3.3.3
MICE <- mice(M[1:n, 4:26], predictorMatrix = quickpred(M[1:n, 4:26]), method = "mean", printFlag = F)
M[1:n, 4:26] <- complete(MICE, action = 1)
MICE <- mice(M[(1+n):(m+n), ], predictorMatrix = quickpred(M[(1+n):(m+n), ]), method = "mean", printFlag = F)
M[(1+n):(m+n), ] <- complete(MICE, action = 1)
M$CAR_AGE <- as.integer(M$CAR_AGE)
aggr(M[1:n, 4:26], bars=F, sortVars=T)

## 
##  Variables sorted by number of missings: 
##    Variable Count
##    KIDSDRIV     0
##         AGE     0
##    HOMEKIDS     0
##         YOJ     0
##      INCOME     0
##     PARENT1     0
##    HOME_VAL     0
##     MSTATUS     0
##         SEX     0
##   EDUCATION     0
##         JOB     0
##    TRAVTIME     0
##     CAR_USE     0
##    BLUEBOOK     0
##         TIF     0
##    CAR_TYPE     0
##     RED_CAR     0
##    OLDCLAIM     0
##    CLM_FREQ     0
##     REVOKED     0
##     MVR_PTS     0
##     CAR_AGE     0
##  URBANICITY     0

Missing values were replaced with the mean value using Multivariate Imputation by Chained Equations (MICE).

Recode Categorical Variables

M$PHD <- ifelse(M$EDUCATION == "PhD", 1, 0)
M$MASTERS <- ifelse(M$EDUCATION == "Masters", 1, 0)
M$BACHELORS <- ifelse(M$EDUCATION == "Bachelors", 1, 0)
M$HS <- ifelse(M$EDUCATION == "z_High School", 1, 0)
M$NOHS <- ifelse(M$EDUCATION == "<High School", 1, 0)

M$CLERICAL <- ifelse(M$JOB == "Clerical", 1, 0)
M$DOCTOR <- ifelse(M$JOB == "Doctor", 1, 0)
M$HOME_MAKER <- ifelse(M$JOB == "Home Maker", 1, 0)
M$LAWYER <- ifelse(M$JOB == "Lawyer", 1, 0)
M$MANAGER <- ifelse(M$JOB == "Manager", 1, 0)
M$PROF <- ifelse(M$JOB == "Professional", 1, 0)
M$STUDENT <- ifelse(M$JOB == "Student", 1, 0)
M$BLUE_COLLAR <- ifelse(M$JOB == "z_Blue Collar", 1, 0)

M$MINIVAN <- ifelse(M$CAR_TYPE == "Minivan", 1, 0)
M$TRUCK <- ifelse(M$CAR_TYPE == "Panel Truck", 1, 0)
M$PICKUP <- ifelse(M$CAR_TYPE == "Pickup", 1, 0)
M$SPORTS <- ifelse(M$CAR_TYPE == "Sports Car", 1, 0)
M$VAN <- ifelse(M$CAR_TYPE == "Van", 1, 0)
M$SUV <- ifelse(M$CAR_TYPE == "z_SUV", 1, 0)

remove <- c("EDUCATION", "JOB", "CAR_TYPE")
X <- rbind(X, data.frame("TARGET_FLAG" = rep(T, ncol(M)-nrow(X)), 
                         "TARGET_AMT" = rep(T, ncol(M)-nrow(X))))
X[match(remove, names(M)), ] <- F

Catergorial variables were changed to binary.

Correlations

library(reshape2)
Corr_XY <- function(X, Y) {
  corr <- data.frame(array(NA, dim = c(ncol(X), 5)))
  colnames(corr) <- c("Y", "X", "r","p","<0.05")
  for (i in 1:ncol(X)) {
    r <- cor.test(Y[, 1], X[, i])
    corr[i, 1] <- names(Y)
    corr[i, 2] <- names(X[i])
    corr[i, 3] <- r$estimate
    corr[i, 4] <- r$p.value
    corr[i, 5] <- corr[i, 4] < 0.05
  }
  return(corr)
}
Corr_XX <- function(X, threshold) {
  corr <- data.frame(array(NA, dim = c(choose(ncol(X), 2), 5)))
  colnames(corr) <- c("X1", "X2", "r","p","<0.05"); k = 1
  for (i in 1:(ncol(X) - 1)) {
    for (j in (i+1):ncol(X)) {
      r <- cor.test(X[,i], X[,j])
      corr[k, 1] <- names(X[i])
      corr[k, 2] <- names(X[j])
      corr[k, 3] <- r$estimate
      corr[k, 4] <- r$p.value
      corr[k, 5] <- corr[i, 4] < 0.05
      k = k + 1
    }
  }
  least <- corr[corr[,"<0.05"] == F, ]
  most <- corr[abs(corr[,"r"]) >= threshold, ]
  result <- list("Correlations" = corr, "Least_Correlated"= least, "Most_Correlated" = most)
  return(result)
}

Between \(\textrm{TARGET}_\textrm{AMT}\) and \(X\) Variables

The specification M[1:n, -c(1:3, categorical)] creates a data frame excluding the INDEX, TARGET_FLAG, TARGET_AMT, and categorical variables. The specification M[1:n, 2, drop = FALSE] creates a data frame with the \(Y\) of interest and retains the column name.

correlations <- Corr_XY(M[1:n, -c(1:3, categorical)], M[1:n, 3, drop = FALSE])
display(correlations)

The predictor variables SEX, BLUEBOOK, RED_CAR, BACHELORS, CLERICAL, HOME_MAKER, PROF, LAWYER, YOJ, and SUV do not have statistically significant correlations with the response variable and are therefore not being considered for the model. The variable YOJ sits at the threshold of statistical viability, and will be left in.

remove <- c("SEX", "BLUEBOOK", "RED_CAR", "BACHELORS", "CLERICAL", "HOME_MAKER", "PROF", "LAWYER", "YOJ", "SUV")
X[match(remove, names(M)), "TARGET_AMT"] <- F

Between \(\textrm{TARGET}_\textrm{FLAG}\) and \(X\) Variables

The specification M[1:n, -c(1:3, categorical)] creates a data frame excluding the INDEX, TARGET_FLAG, TARGET_AMT, and categorical variables. The specification M[1:n, 2, drop = FALSE] creates a data frame with the \(Y\) of interest and retains the column name.

correlations <- Corr_XY(M[1:n, -c(1:3, categorical)], M[1:n, 2, drop = FALSE])
display(correlations)

The predictor variables SEX, RED_CAR, HOME_MAKER, TRUCK, and VAN do not have statistically significant correlations with the response variable and are therefore not being considered for the model.

remove <- c("SEX", "RED_CAR", "HOME_MAKER", "TRUCK", "VAN")
X[match(remove, names(M)), "TARGET_FLAG"] <- F

Between All \(X\) Variables

correlations <- Corr_XX(M[1:n,  (X[,"TARGET_AMT"] & X[,"TARGET_FLAG"])], 0.50)
display(correlations$Least_Correlated)
display(correlations$Most_Correlated)

The specification M[1:n, (X[,"TARGET_AMT"] & X[,"TARGET_FLAG"])] creates a data frame excluding INDEX, TARGET_FLAG, TARGET_AMT, and the variables previously marked for removal due to statistically significant correlations with the response variable. There are strong statistically significant correlations between HOME_VAL & INCOME, PHD & DOCTOR, and MASTERS & LAWYER. From these paired correlated variables we find that HOME_VAL, DOCTOR, and LAWYER are least correlated to both TARGET_FLAG and TARGET_AMT. These three variables will therefore not be considered for the model. It is worth noting that the high correlation between PHD & DOCTOR and MASTERS & LAWYER is likely due to prior imputation.

remove <- c("HOME_VAL", "DOCTOR", "LAWYER")
X[match(remove, names(M)), ] <- F

Logarithmic Transformation

library(MASS)
columns <- c("INCOME", "HOME_VAL", "TRAVTIME", "BLUEBOOK", "TIF", "MVR_PTS")
fit_exp <- function(X, fields) {
  potential <- match(fields, names(X))
  lambda <- numeric(ncol(X))
  par(mfrow=c(2,3))
  for (i in potential) {
    shifted <- X[, i] - min(X[, i]) + 1e-32
    fit_exp <- fitdistr(shifted, "Exponential")
    lambda[i] <- fit_exp$estimate
    exp <- rexp(1000, lambda[i])
    hist(X[, i], prob=TRUE, col="grey", main =names(X[i]), 
         xlab=paste("Lambda =",fractions(lambda[i])))
    lines(density(exp), col="blue", lwd=2)
  }
  lambda <- data.frame("VARIABLE"=fields, "LAMBDA"=lambda[potential])
  return(lambda)
}
lambda <- fit_exp(M[1:n, ], columns)

lambda
##   VARIABLE       LAMBDA
## 1   INCOME 1.615559e-05
## 2 HOME_VAL 5.868625e-06
## 3 TRAVTIME 3.510530e-02
## 4 BLUEBOOK 7.037347e-05
## 5      TIF 2.298161e-01
## 6  MVR_PTS 5.897955e-01

Five of the six potential variables lend themselves toward modeling with an exponential distribution. The variables were shifted to slightly above zero by subtracting the minimum value and then adding \(1^{-32}\) to the modified value. This would also shift data with a negative minimum in the appropriate direction since subtracting the negative minimum value equates to adding the minimum value.

M[, "log_INCOME"] <- log(M[, "INCOME"] - min(M[, "INCOME"]) + 1e-32, lambda[1,2])
M[, "log_TRAVTIME"] <- log(M[, "TRAVTIME"] - min(M[, "TRAVTIME"]) + 1e-32, lambda[3,2])
M[, "log_BLUEBOOK"] <- log(M[, "BLUEBOOK"] - min(M[, "BLUEBOOK"]) + 1e-32, lambda[4,2])
M[, "log_TIF"] <- log(M[, "TIF"] - min(M[, "TIF"]) + 1e-32, lambda[5,2])
M[, "log_MVR_PTS"] <- log(M[, "MVR_PTS"] - min(M[, "MVR_PTS"]) + 1e-32, lambda[6,2])
remove <- c("INCOME", "TRAVTIME", "BLUEBOOK", "TIF", "MVR_PTS")
X <- rbind(X, data.frame("TARGET_FLAG" = rep(T, ncol(M)-nrow(X)), 
                         "TARGET_AMT" = rep(T, ncol(M)-nrow(X))))
X[match(remove, names(M)), ] <- F

Categorization of Multimodal Data

par(mfrow=c(2,3))
smoothScatter(M[1:n, "KIDSDRIV"], ylab = "KIDSDRIV")
smoothScatter(M[1:n, "HOMEKIDS"], ylab = "HOMEKIDS")
smoothScatter(M[1:n, "YOJ"], ylab = "YOJ")
smoothScatter(M[1:n, "OLDCLAIM"], ylab = "OLDCLAIM")
smoothScatter(M[1:n, "CLM_FREQ"], ylab = "CLM_FREQ")
smoothScatter(M[1:n, "CAR_AGE"], ylab = "CAR_AGE")

The variables KIDSDRIV, HOMEKIDS, YOJ, OLDCLAIM, CLM_FREQ, and CAR_AGE have bimodal distributions. There are clear lines of demarcation in the values that we can use to bifurcate the variables into categories. The defining value for each of these variables is zero. Therefore, we can categorize the variables as zero if the value is equal to zero, and one otherwise.

M[,"cat_KIDSDRIV"] <- ifelse(M$KIDSDRIV == 0, 0, 1)
M[,"cat_HOMEKIDS"] <- ifelse(M$HOMEKIDS == 0, 0, 1)
M[,"cat_YOJ"] <- ifelse(M$YOJ == 0, 0, 1)
M[,"cat_OLDCLAIM"] <- ifelse(M$OLDCLAIM == 0, 0, 1)
M[,"cat_CLM_FREQ"] <- ifelse(M$CLM_FREQ == 0, 0, 1)
M[,"cat_CAR_AGE"] <- ifelse(M$CAR_AGE == 0, 0, 1)
remove <- c("KIDSDRIV", "HOMEKIDS", "YOJ", "OLDCLAIM", "CLM_FREQ", "CAR_AGE")
X <- rbind(X, data.frame("TARGET_FLAG" = rep(T, ncol(M)-nrow(X)), 
                         "TARGET_AMT" = rep(T, ncol(M)-nrow(X))))
X[match(remove, names(M)), ] <- F

Box-Cox Transformation

library(car)
potential <- match(c("AGE"), names(M))
box.cox.powers <- powerTransform(M[1:n, potential], family="bcPower")
summary(box.cox.powers)
## bcPower Transformation to Normality 
##                   Est.Power Std.Err. Wald Lower Bound Wald Upper Bound
## M[1:n, potential]    1.0391   0.0454             0.95           1.1282
## 
## Likelihood ratio tests about transformation parameters
##                               LRT df     pval
## LR test, lambda = (0) 559.6780078  1 0.000000
## LR test, lambda = (1)   0.7421689  1 0.388967

The only unexamined variable that the Box-Cox Transformation could potentially be applied to is the quantitative variable AGE. However, upon examination, examination AGE returns an estimated power close to one which indicates that no transformation is necessary. This is further supported by the boundaries which include the value of one in the range.

Build Models

Using the training data set, build at least two different multiple linear regression (MLR) models and three different binary logistic regression (BLR) models, using different variables (or the same variables with different transformations). You may select the variables manually, use an approach such as Forward or Stepwise, use a different approach such as trees, or use a combination of techniques. Describe the techniques you used. If you manually selected a variable for inclusion into the model or exclusion into the model, indicate why this was done.

Discuss the coefficients in the models, do they make sense? For example, if a person has a lot of traffic tickets, you would reasonably expect that person to have more car crashes. If the coefficient is negative (suggesting that the person is a safer driver), then that needs to be discussed. Are you keeping the model even though it is counter intuitive? Why? The boss needs to know.

training_AMT <- M[1:n, X[,"TARGET_AMT"]]
training_FLAG <- M[1:n, X[,"TARGET_FLAG"]]

Forward Selection

Forward stepwise subset selection based on \(AIC\). Using \(k = 2\) degrees of freedom for the penalty gives the genuine \(AIC\). Using \(k = log(n)\) is sometimes referred to as BIC or SBC.

Multiple Linear Regression

null <- lm(TARGET_AMT ~ 0, training_AMT)
full <- lm(TARGET_AMT ~ ., training_AMT)
aic_steps <- step(null, scope=list(lower=null, upper=full), direction="forward", k = 2, trace=F)
aic_steps$call
## lm(formula = TARGET_AMT ~ URBANICITY + cat_OLDCLAIM + PARENT1 + 
##     MANAGER + CAR_USE + cat_CAR_AGE + MINIVAN + log_MVR_PTS + 
##     MSTATUS + cat_KIDSDRIV + log_TIF + NOHS + HS + REVOKED + 
##     SPORTS + log_INCOME + log_TRAVTIME - 1, data = training_AMT)

The above model has the lowest AIC.

forward_AMT <- lm(aic_steps$call, training_AMT)
round(coef(summary(forward_AMT)), 6)
##                 Estimate Std. Error   t value Pr(>|t|)
## URBANICITY   1505.085315 136.313068 11.041387 0.000000
## cat_OLDCLAIM  552.793025 114.474396  4.828967 0.000001
## PARENT1       640.172763 177.545247  3.605688 0.000313
## MANAGER      -851.066664 162.431235 -5.239551 0.000000
## CAR_USE      -785.064181 111.841871 -7.019412 0.000000
## cat_CAR_AGE   761.221081 182.601816  4.168749 0.000031
## MINIVAN      -470.879470 121.461053 -3.876794 0.000107
## log_MVR_PTS    -2.769052   0.757178 -3.657068 0.000257
## MSTATUS      -585.805201 119.630701 -4.896780 0.000001
## cat_KIDSDRIV  692.326953 162.330292  4.264928 0.000020
## log_TIF         6.991248   2.130387  3.281680 0.001036
## NOHS          657.591808 153.642482  4.280013 0.000019
## HS            470.297182 120.908400  3.889698 0.000101
## REVOKED       471.513352 155.133315  3.039407 0.002378
## SPORTS        321.992205 168.356009  1.912567 0.055839
## log_INCOME     45.990011  25.543071  1.800489 0.071820
## log_TRAVTIME  -18.650708  11.105457 -1.679418 0.093109

At a significance level of \(\alpha=0.5\), the 17 forward selected variables yield a TARGET_AMT MLR model with three insignificant variables: SPORTS, log_INCOME, and log_TRAVTIME. Removing those three insignificant variables yields a model with all significant variables.

forward_AMT <- lm(TARGET_AMT ~ URBANICITY + cat_OLDCLAIM + PARENT1 + 
    MANAGER + CAR_USE + MINIVAN + log_MVR_PTS + MSTATUS + cat_CAR_AGE +
    cat_KIDSDRIV + log_TIF + NOHS + HS + REVOKED - 1, training_AMT)
summary(forward_AMT)
## 
## Call:
## lm(formula = TARGET_AMT ~ URBANICITY + cat_OLDCLAIM + PARENT1 + 
##     MANAGER + CAR_USE + MINIVAN + log_MVR_PTS + MSTATUS + cat_CAR_AGE + 
##     cat_KIDSDRIV + log_TIF + NOHS + HS + REVOKED - 1, data = training_AMT)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -4997  -1694   -818    354 104984 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## URBANICITY   1469.0432   135.8182  10.816  < 2e-16 ***
## cat_OLDCLAIM  572.1325   114.3497   5.003 5.75e-07 ***
## PARENT1       655.2860   177.4451   3.693 0.000223 ***
## MANAGER      -884.0253   162.0108  -5.457 5.00e-08 ***
## CAR_USE      -729.9259   109.4039  -6.672 2.69e-11 ***
## MINIVAN      -540.1926   117.7244  -4.589 4.53e-06 ***
## log_MVR_PTS    -2.8047     0.7574  -3.703 0.000214 ***
## MSTATUS      -581.0082   119.6464  -4.856 1.22e-06 ***
## cat_CAR_AGE   772.8248   181.7666   4.252 2.14e-05 ***
## cat_KIDSDRIV  683.0515   162.3725   4.207 2.62e-05 ***
## log_TIF         7.1675     2.1301   3.365 0.000769 ***
## NOHS          676.0476   153.2975   4.410 1.05e-05 ***
## HS            496.4631   120.3608   4.125 3.75e-05 ***
## REVOKED       474.5687   155.1805   3.058 0.002234 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4560 on 8147 degrees of freedom
## Multiple R-squared:  0.1488, Adjusted R-squared:  0.1473 
## F-statistic: 101.7 on 14 and 8147 DF,  p-value: < 2.2e-16

Binary Logistic Regression

null <- glm(TARGET_FLAG ~ 0, family = binomial(link = "logit"), training_FLAG)
full <- glm(TARGET_FLAG ~ ., family = binomial(link = "logit"), training_FLAG)
aic_steps <- step(null, scope=list(lower=null, upper=full), direction="forward", k = 2, trace=F)
aic_steps$aic
## [1] 7500.651
aic_steps$formula
## TARGET_FLAG ~ AGE + URBANICITY + cat_OLDCLAIM + CAR_USE + cat_YOJ + 
##     MSTATUS + MANAGER + MINIVAN + REVOKED + cat_KIDSDRIV + HS + 
##     NOHS + cat_CAR_AGE + log_TIF + log_MVR_PTS + cat_HOMEKIDS + 
##     SPORTS + log_TRAVTIME + SUV + PHD + log_INCOME + CLERICAL + 
##     PICKUP + log_BLUEBOOK + BLUE_COLLAR + PARENT1 - 1

The above model has the lowest AIC.

forward_FLAG <- glm(aic_steps$formula, family = binomial(link = "logit"), training_FLAG)
round(coef(summary(forward_FLAG)), 6)
##               Estimate Std. Error   z value Pr(>|z|)
## AGE          -0.003424   0.003925 -0.872196 0.383101
## URBANICITY    2.224240   0.110518 20.125582 0.000000
## cat_OLDCLAIM  0.553906   0.060688  9.127162 0.000000
## CAR_USE      -0.729315   0.081633 -8.934078 0.000000
## cat_YOJ      -0.000101   0.283764 -0.000355 0.999717
## MSTATUS      -0.692176   0.074923 -9.238513 0.000000
## MANAGER      -0.712284   0.107968 -6.597173 0.000000
## MINIVAN      -0.418051   0.105548 -3.960787 0.000075
## REVOKED       0.729757   0.079316  9.200623 0.000000
## cat_KIDSDRIV  0.556026   0.096696  5.750221 0.000000
## HS            0.513970   0.075100  6.843797 0.000000
## NOHS          0.546576   0.097037  5.632669 0.000000
## cat_CAR_AGE  -2.817303   0.329517 -8.549807 0.000000
## log_TIF       0.007683   0.001190  6.459280 0.000000
## log_MVR_PTS  -0.002206   0.000437 -5.052844 0.000000
## cat_HOMEKIDS  0.224665   0.096231  2.334635 0.019563
## SPORTS        0.636551   0.119359  5.333084 0.000000
## log_TRAVTIME -0.028562   0.007067 -4.041575 0.000053
## SUV           0.384536   0.100159  3.839250 0.000123
## PHD          -0.314215   0.115890 -2.711326 0.006701
## log_INCOME    0.106774   0.037340  2.859542 0.004243
## CLERICAL      0.325682   0.097022  3.356792 0.000789
## PICKUP        0.223960   0.095602  2.342627 0.019149
## log_BLUEBOOK  0.061436   0.023980  2.561970 0.010408
## BLUE_COLLAR   0.188304   0.090050  2.091115 0.036518
## PARENT1       0.215201   0.118922  1.809608 0.070357

At a significance level of \(\alpha=0.5\), the 26 forward selected variables yield a TARGET_FLAG BLR model with three insignificant variables: AGE, cat_YOJ, and PARENT1. Removing those three insignificant variables yields a model with all significant variables.

forward_FLAG <- glm(TARGET_FLAG ~ URBANICITY + cat_OLDCLAIM + CAR_USE +
    MSTATUS + MANAGER + MINIVAN + REVOKED + cat_KIDSDRIV + HS + 
    NOHS + cat_CAR_AGE + log_TIF + log_MVR_PTS + cat_HOMEKIDS + 
    SPORTS + log_TRAVTIME + SUV + PHD + log_INCOME + CLERICAL + 
    PICKUP + log_BLUEBOOK + BLUE_COLLAR - 1, family = binomial(link = "logit"), training_FLAG)
summary(forward_FLAG)
## 
## Call:
## glm(formula = TARGET_FLAG ~ URBANICITY + cat_OLDCLAIM + CAR_USE + 
##     MSTATUS + MANAGER + MINIVAN + REVOKED + cat_KIDSDRIV + HS + 
##     NOHS + cat_CAR_AGE + log_TIF + log_MVR_PTS + cat_HOMEKIDS + 
##     SPORTS + log_TRAVTIME + SUV + PHD + log_INCOME + CLERICAL + 
##     PICKUP + log_BLUEBOOK + BLUE_COLLAR - 1, family = binomial(link = "logit"), 
##     data = training_FLAG)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.1946  -0.7273  -0.4224   0.6889   3.1402  
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)    
## URBANICITY    2.2185007  0.1101897  20.133  < 2e-16 ***
## cat_OLDCLAIM  0.5511070  0.0606483   9.087  < 2e-16 ***
## CAR_USE      -0.7312157  0.0815656  -8.965  < 2e-16 ***
## MSTATUS      -0.7827185  0.0585965 -13.358  < 2e-16 ***
## MANAGER      -0.7090159  0.1077946  -6.577 4.79e-11 ***
## MINIVAN      -0.4156174  0.1055195  -3.939 8.19e-05 ***
## REVOKED       0.7302799  0.0793086   9.208  < 2e-16 ***
## cat_KIDSDRIV  0.5302165  0.0941368   5.632 1.78e-08 ***
## HS            0.5146327  0.0749761   6.864 6.70e-12 ***
## NOHS          0.5475820  0.0968129   5.656 1.55e-08 ***
## cat_CAR_AGE  -2.9259596  0.1395232 -20.971  < 2e-16 ***
## log_TIF       0.0076797  0.0011890   6.459 1.06e-10 ***
## log_MVR_PTS  -0.0022271  0.0004362  -5.106 3.29e-07 ***
## cat_HOMEKIDS  0.3568313  0.0683715   5.219 1.80e-07 ***
## SPORTS        0.6357416  0.1191917   5.334 9.62e-08 ***
## log_TRAVTIME -0.0283289  0.0070487  -4.019 5.84e-05 ***
## SUV           0.3844424  0.1001565   3.838 0.000124 ***
## PHD          -0.3247954  0.1155157  -2.812 0.004928 ** 
## log_INCOME    0.1076736  0.0150112   7.173 7.34e-13 ***
## CLERICAL      0.3327295  0.0966097   3.444 0.000573 ***
## PICKUP        0.2252018  0.0956307   2.355 0.018527 *  
## log_BLUEBOOK  0.0618683  0.0239369   2.585 0.009748 ** 
## BLUE_COLLAR   0.1911011  0.0899125   2.125 0.033552 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 11313.5  on 8161  degrees of freedom
## Residual deviance:  7452.6  on 8138  degrees of freedom
## AIC: 7498.6
## 
## Number of Fisher Scoring iterations: 5

Backward Elimination

Backward stepwise subset elimination based on \(AIC\). Using \(k = 2\) degrees of freedom for the penalty gives the genuine \(AIC\). Using \(k = log(n)\) is sometimes referred to as BIC or SBC.

Multiple Linear Regression

null <- lm(TARGET_AMT ~ 0, training_AMT)
full <- lm(TARGET_AMT ~ ., training_AMT)
aic_steps <- step(full, scope=list(lower=null, upper=full), direction="backward", k = 2, trace=F)
aic_steps$call
## lm(formula = TARGET_AMT ~ PARENT1 + MSTATUS + CAR_USE + REVOKED + 
##     URBANICITY + HS + NOHS + MANAGER + MINIVAN + SPORTS + log_INCOME + 
##     log_TRAVTIME + log_TIF + log_MVR_PTS + cat_KIDSDRIV + cat_OLDCLAIM, 
##     data = training_AMT)

The above model has the lowest AIC.

backward_AMT <- lm(aic_steps$call, training_AMT)
round(coef(summary(backward_AMT)), 6)
##                 Estimate Std. Error   t value Pr(>|t|)
## (Intercept)   762.784417 183.151125  4.164782 0.000031
## PARENT1       640.018132 177.569345  3.604328 0.000315
## MSTATUS      -586.551921 119.723526 -4.899220 0.000001
## CAR_USE      -785.707613 111.900046 -7.021513 0.000000
## REVOKED       470.841555 155.143313  3.034881 0.002414
## URBANICITY   1505.218992 136.333458 11.040716 0.000000
## HS            470.034608 120.940482  3.886495 0.000103
## NOHS          657.777887 153.639789  4.281299 0.000019
## MANAGER      -851.651499 162.435546 -5.243012 0.000000
## MINIVAN      -471.449326 121.477290 -3.880967 0.000105
## SPORTS        322.212013 168.353540  1.913901 0.055668
## log_INCOME     46.035671  25.544003  1.802210 0.071549
## log_TRAVTIME  -18.636693  11.105620 -1.678132 0.093360
## log_TIF         6.989195   2.130538  3.280483 0.001041
## log_MVR_PTS    -2.773052   0.757597 -3.660325 0.000253
## cat_KIDSDRIV  691.794967 162.331388  4.261622 0.000021
## cat_OLDCLAIM  552.008131 114.513722  4.820454 0.000001

At a significance level of \(\alpha=0.5\), the 17 backward selected variables yield a TARGET_AMT MLR model with three insignificant variables: SPORTS, log_INCOME, and log_TRAVTIME. Removing those three insignificant variables yields a model with all significant variables.

backward_AMT <- lm(TARGET_AMT ~ PARENT1 + MSTATUS + CAR_USE + REVOKED + 
    URBANICITY + HS + NOHS + MANAGER + MINIVAN + log_TIF + 
    log_MVR_PTS + cat_KIDSDRIV + cat_OLDCLAIM, training_AMT)
summary(backward_AMT)
## 
## Call:
## lm(formula = TARGET_AMT ~ PARENT1 + MSTATUS + CAR_USE + REVOKED + 
##     URBANICITY + HS + NOHS + MANAGER + MINIVAN + log_TIF + log_MVR_PTS + 
##     cat_KIDSDRIV + cat_OLDCLAIM, data = training_AMT)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -4997  -1694   -819    354 104984 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   774.2430   182.3082   4.247 2.19e-05 ***
## PARENT1       655.1922   177.4678   3.692 0.000224 ***
## MSTATUS      -581.7175   119.7388  -4.858 1.21e-06 ***
## CAR_USE      -730.5026   109.4617  -6.674 2.66e-11 ***
## REVOKED       473.8986   155.1903   3.054 0.002268 ** 
## URBANICITY   1469.2228   135.8413  10.816  < 2e-16 ***
## HS            496.2580   120.3904   4.122 3.79e-05 ***
## NOHS          676.2812   153.2939   4.412 1.04e-05 ***
## MANAGER      -884.6343   162.0158  -5.460 4.90e-08 ***
## MINIVAN      -540.8132   117.7404  -4.593 4.43e-06 ***
## log_TIF         7.1657     2.1302   3.364 0.000772 ***
## log_MVR_PTS    -2.8086     0.7578  -3.706 0.000212 ***
## cat_KIDSDRIV  682.5059   162.3736   4.203 2.66e-05 ***
## cat_OLDCLAIM  571.3661   114.3887   4.995 6.01e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4560 on 8147 degrees of freedom
## Multiple R-squared:  0.0617, Adjusted R-squared:  0.0602 
## F-statistic: 41.21 on 13 and 8147 DF,  p-value: < 2.2e-16

Binary Logistic Regression

null <- glm(TARGET_FLAG ~ 0, family = binomial(link = "logit"), training_FLAG)
full <- glm(TARGET_FLAG ~ ., family = binomial(link = "logit"), training_FLAG)
aic_steps <- step(full, scope=list(lower=null, upper=full), direction="backward", k = 2, trace=F)
aic_steps$formula
## TARGET_FLAG ~ PARENT1 + MSTATUS + CAR_USE + REVOKED + URBANICITY + 
##     PHD + MASTERS + BACHELORS + CLERICAL + MANAGER + STUDENT + 
##     BLUE_COLLAR + MINIVAN + PICKUP + SPORTS + SUV + log_INCOME + 
##     log_TRAVTIME + log_BLUEBOOK + log_TIF + log_MVR_PTS + cat_KIDSDRIV + 
##     cat_HOMEKIDS + cat_OLDCLAIM

The above model has the lowest AIC.

backward_FLAG <- glm(aic_steps$formula, family = binomial(link = "logit"), training_FLAG)
round(coef(summary(backward_FLAG)), 6)
##               Estimate Std. Error    z value Pr(>|z|)
## (Intercept)  -2.554923   0.160423 -15.926206 0.000000
## PARENT1       0.217757   0.119008   1.829758 0.067286
## MSTATUS      -0.694028   0.074400  -9.328352 0.000000
## CAR_USE      -0.688893   0.080778  -8.528279 0.000000
## REVOKED       0.733365   0.079366   9.240333 0.000000
## URBANICITY    2.249695   0.111388  20.196878 0.000000
## PHD          -0.791338   0.133765  -5.915870 0.000000
## MASTERS      -0.508002   0.105529  -4.813852 0.000001
## BACHELORS    -0.485927   0.077640  -6.258734 0.000000
## CLERICAL      0.382111   0.106668   3.582245 0.000341
## MANAGER      -0.690384   0.109689  -6.293993 0.000000
## STUDENT       0.184715   0.126539   1.459743 0.144361
## BLUE_COLLAR   0.250666   0.103458   2.422876 0.015398
## MINIVAN      -0.435240   0.105575  -4.122578 0.000037
## PICKUP        0.219309   0.096226   2.279105 0.022661
## SPORTS        0.616909   0.118944   5.186530 0.000000
## SUV           0.364174   0.099838   3.647647 0.000265
## log_INCOME    0.100999   0.015677   6.442715 0.000000
## log_TRAVTIME -0.028743   0.007072  -4.064134 0.000048
## log_BLUEBOOK  0.059508   0.024005   2.479025 0.013174
## log_TIF       0.007729   0.001190   6.493015 0.000000
## log_MVR_PTS  -0.002189   0.000437  -5.008679 0.000001
## cat_KIDSDRIV  0.542256   0.094714   5.725173 0.000000
## cat_HOMEKIDS  0.252943   0.087625   2.886646 0.003894
## cat_OLDCLAIM  0.557205   0.060751   9.171972 0.000000

At a significance level of \(\alpha=0.5\), the 25 backward selected variables yield a TARGET_FLAG BLR model with three insignificant variables: PARENT1 and STUDENT. Removing those two insignificant variables yields a model with all significant variables.

backward_FLAG <- glm(TARGET_FLAG ~ MSTATUS + CAR_USE + REVOKED + URBANICITY + 
    PHD + MASTERS + BACHELORS + CLERICAL + MANAGER + 
    BLUE_COLLAR + MINIVAN + PICKUP + SPORTS + SUV + log_INCOME + 
    log_TRAVTIME + log_BLUEBOOK + log_TIF + log_MVR_PTS + cat_KIDSDRIV + 
    cat_HOMEKIDS + cat_OLDCLAIM, family = binomial(link = "logit"), training_FLAG)
summary(backward_FLAG)
## 
## Call:
## glm(formula = TARGET_FLAG ~ MSTATUS + CAR_USE + REVOKED + URBANICITY + 
##     PHD + MASTERS + BACHELORS + CLERICAL + MANAGER + BLUE_COLLAR + 
##     MINIVAN + PICKUP + SPORTS + SUV + log_INCOME + log_TRAVTIME + 
##     log_BLUEBOOK + log_TIF + log_MVR_PTS + cat_KIDSDRIV + cat_HOMEKIDS + 
##     cat_OLDCLAIM, family = binomial(link = "logit"), data = training_FLAG)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.2016  -0.7231  -0.4212   0.6891   3.1351  
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -2.4312711  0.1486779 -16.353  < 2e-16 ***
## MSTATUS      -0.7777777  0.0586439 -13.263  < 2e-16 ***
## CAR_USE      -0.7155003  0.0787814  -9.082  < 2e-16 ***
## REVOKED       0.7343615  0.0793348   9.256  < 2e-16 ***
## URBANICITY    2.2363335  0.1109238  20.161  < 2e-16 ***
## PHD          -0.8497653  0.1275975  -6.660 2.74e-11 ***
## MASTERS      -0.5639725  0.0979155  -5.760 8.42e-09 ***
## BACHELORS    -0.5101645  0.0752438  -6.780 1.20e-11 ***
## CLERICAL      0.3256558  0.0990172   3.289  0.00101 ** 
## MANAGER      -0.7098209  0.1082712  -6.556 5.53e-11 ***
## BLUE_COLLAR   0.1852830  0.0928333   1.996  0.04595 *  
## MINIVAN      -0.4147417  0.1048748  -3.955 7.67e-05 ***
## PICKUP        0.2317109  0.0958995   2.416  0.01568 *  
## SPORTS        0.6331507  0.1186074   5.338 9.39e-08 ***
## SUV           0.3802718  0.0993833   3.826  0.00013 ***
## log_INCOME    0.1067877  0.0151806   7.034 2.00e-12 ***
## log_TRAVTIME -0.0285046  0.0070522  -4.042 5.30e-05 ***
## log_BLUEBOOK  0.0604612  0.0239746   2.522  0.01167 *  
## log_TIF       0.0077168  0.0011898   6.486 8.82e-11 ***
## log_MVR_PTS  -0.0021850  0.0004367  -5.004 5.63e-07 ***
## cat_KIDSDRIV  0.5297128  0.0942272   5.622 1.89e-08 ***
## cat_HOMEKIDS  0.3609836  0.0685862   5.263 1.42e-07 ***
## cat_OLDCLAIM  0.5555339  0.0607168   9.150  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 9418.0  on 8160  degrees of freedom
## Residual deviance: 7446.2  on 8138  degrees of freedom
## AIC: 7492.2
## 
## Number of Fisher Scoring iterations: 5

Adjusted \(R^2\)

library(leaps)
model_sum_AMT <- summary(regsubsets(TARGET_AMT ~ ., training_AMT, nvmax=ncol(training_AMT)))
## Reordering variables and trying again:
model_sum_FLAG <- summary(regsubsets(TARGET_FLAG ~ ., training_FLAG, nvmax=ncol(training_FLAG)))
## Reordering variables and trying again:
par(mfrow=c(1,2))
plot(model_sum_AMT$adjr2, xlab = "Number of Variables", ylab = "Adj R-squared", main="TARGET_AMT")
plot(model_sum_FLAG$adjr2, xlab = "Number of Variables", ylab = "Adj R-squared", main="TARGET_FLAG")

cbind(max(model_sum_AMT$adjr2), which.max(model_sum_AMT$adjr2))
##            [,1] [,2]
## [1,] 0.06123634   19
cbind(max(model_sum_FLAG$adjr2), which.max(model_sum_FLAG$adjr2))
##          [,1] [,2]
## [1,] 0.211793   24

Multiple Linear Regression

The maximum Adjusted \(R^2\) of 0.0612363 for the model predicting TARGET_AMT is reached when the model contains 19 variables.

model_sum_AMT$which[which.max(model_sum_AMT$adjr2), ]
##  (Intercept)          AGE      PARENT1      MSTATUS      CAR_USE 
##         TRUE        FALSE         TRUE         TRUE         TRUE 
##      REVOKED   URBANICITY          PHD      MASTERS           HS 
##         TRUE         TRUE         TRUE        FALSE         TRUE 
##         NOHS      MANAGER      STUDENT  BLUE_COLLAR      MINIVAN 
##         TRUE         TRUE        FALSE        FALSE         TRUE 
##        TRUCK       PICKUP       SPORTS          VAN   log_INCOME 
##        FALSE         TRUE         TRUE        FALSE         TRUE 
## log_TRAVTIME log_BLUEBOOK      log_TIF  log_MVR_PTS cat_KIDSDRIV 
##         TRUE        FALSE         TRUE         TRUE         TRUE 
## cat_HOMEKIDS      cat_YOJ cat_OLDCLAIM cat_CLM_FREQ  cat_CAR_AGE 
##         TRUE        FALSE        FALSE         TRUE        FALSE
adjustedr2_AMT <- lm(TARGET_AMT ~ 1 + PARENT1 + MSTATUS + CAR_USE + 
    REVOKED + URBANICITY + PHD + HS + NOHS + MANAGER + MINIVAN + PICKUP + 
    SPORTS + log_INCOME + log_TRAVTIME + log_TIF + log_MVR_PTS + 
    cat_KIDSDRIV + cat_HOMEKIDS + cat_CLM_FREQ, training_AMT)
round(coef(summary(adjustedr2_AMT)), 6)
##                 Estimate Std. Error   t value Pr(>|t|)
## (Intercept)   855.350195 190.724913  4.484732 0.000007
## PARENT1       477.258229 217.200288  2.197319 0.028026
## MSTATUS      -647.893956 127.782128 -5.070302 0.000000
## CAR_USE      -803.418566 113.162687 -7.099677 0.000000
## REVOKED       464.971703 155.173002  2.996473 0.002739
## URBANICITY   1524.033915 136.731529 11.146178 0.000000
## PHD          -252.479571 185.347297 -1.362197 0.173173
## HS            425.653740 124.349425  3.423046 0.000622
## NOHS          616.658877 157.387319  3.918098 0.000090
## MANAGER      -843.520474 162.559687 -5.188989 0.000000
## MINIVAN      -514.629399 126.582870 -4.065553 0.000048
## PICKUP       -165.005969 145.517170 -1.133928 0.256858
## SPORTS        275.376716 171.928269  1.601695 0.109262
## log_INCOME     45.399087  25.548734  1.776960 0.075612
## log_TRAVTIME  -18.536784  11.106373 -1.669022 0.095151
## log_TIF         7.028143   2.130404  3.298971 0.000975
## log_MVR_PTS    -2.790571   0.757820 -3.682366 0.000233
## cat_KIDSDRIV  601.792137 177.410599  3.392087 0.000697
## cat_HOMEKIDS  179.571417 150.040877  1.196817 0.231413
## cat_CLM_FREQ  540.063343 114.635246  4.711146 0.000003

At a significance level of \(\alpha=0.5\), the \(R_{adj}^2\) selected variables yield a TARGET_AMT MLR model with six insignificant variables: PHD, PICKUP, SPORTS, og_INCOME, log_TRAVTIME, and cat_HOMEKIDS. Removing those six insignificant variables yields a model with all significant variables.

adjustedr2_AMT <- lm(TARGET_AMT ~ 1 + PARENT1 + MSTATUS + CAR_USE + 
    REVOKED + URBANICITY + HS + NOHS + MANAGER + MINIVAN + log_TIF + 
    log_MVR_PTS + cat_KIDSDRIV + cat_CLM_FREQ, training_AMT)
summary(adjustedr2_AMT)
## 
## Call:
## lm(formula = TARGET_AMT ~ 1 + PARENT1 + MSTATUS + CAR_USE + REVOKED + 
##     URBANICITY + HS + NOHS + MANAGER + MINIVAN + log_TIF + log_MVR_PTS + 
##     cat_KIDSDRIV + cat_CLM_FREQ, data = training_AMT)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -4997  -1694   -819    354 104984 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   774.2430   182.3082   4.247 2.19e-05 ***
## PARENT1       655.1922   177.4678   3.692 0.000224 ***
## MSTATUS      -581.7175   119.7388  -4.858 1.21e-06 ***
## CAR_USE      -730.5026   109.4617  -6.674 2.66e-11 ***
## REVOKED       473.8986   155.1903   3.054 0.002268 ** 
## URBANICITY   1469.2228   135.8413  10.816  < 2e-16 ***
## HS            496.2580   120.3904   4.122 3.79e-05 ***
## NOHS          676.2812   153.2939   4.412 1.04e-05 ***
## MANAGER      -884.6343   162.0158  -5.460 4.90e-08 ***
## MINIVAN      -540.8132   117.7404  -4.593 4.43e-06 ***
## log_TIF         7.1657     2.1302   3.364 0.000772 ***
## log_MVR_PTS    -2.8086     0.7578  -3.706 0.000212 ***
## cat_KIDSDRIV  682.5059   162.3736   4.203 2.66e-05 ***
## cat_CLM_FREQ  571.3661   114.3887   4.995 6.01e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4560 on 8147 degrees of freedom
## Multiple R-squared:  0.0617, Adjusted R-squared:  0.0602 
## F-statistic: 41.21 on 13 and 8147 DF,  p-value: < 2.2e-16

This model is identical to the model derived using Backward Elimination.

Binary Logistic Regression

The maximum Adjusted \(R^2\) of 0.211793 for the model predicting TARGET_FLAG is reached when the model contains 24 variables.

model_sum_FLAG$which[which.max(model_sum_FLAG$adjr2), ]
##  (Intercept)          AGE      PARENT1      MSTATUS      CAR_USE 
##         TRUE        FALSE         TRUE         TRUE         TRUE 
##      REVOKED   URBANICITY          PHD      MASTERS    BACHELORS 
##         TRUE         TRUE         TRUE         TRUE         TRUE 
##           HS         NOHS     CLERICAL      MANAGER         PROF 
##        FALSE        FALSE         TRUE         TRUE        FALSE 
##      STUDENT  BLUE_COLLAR      MINIVAN       PICKUP       SPORTS 
##         TRUE         TRUE         TRUE         TRUE         TRUE 
##          SUV   log_INCOME log_TRAVTIME log_BLUEBOOK      log_TIF 
##         TRUE         TRUE         TRUE         TRUE         TRUE 
##  log_MVR_PTS cat_KIDSDRIV cat_HOMEKIDS      cat_YOJ cat_OLDCLAIM 
##         TRUE         TRUE         TRUE        FALSE         TRUE 
## cat_CLM_FREQ  cat_CAR_AGE 
##        FALSE        FALSE
adjustedr2_FLAG <- glm(TARGET_FLAG ~ 1 + PARENT1 + MSTATUS + CAR_USE + 
    REVOKED + URBANICITY + PHD + MASTERS + BACHELORS + CLERICAL + MANAGER + 
    STUDENT + BLUE_COLLAR + MINIVAN + PICKUP + SPORTS + SUV + log_INCOME + 
    log_TRAVTIME + log_BLUEBOOK + log_TIF + log_MVR_PTS + cat_KIDSDRIV + 
    cat_HOMEKIDS + cat_CLM_FREQ, family = binomial(link = "logit"), training_FLAG)
round(coef(summary(adjustedr2_FLAG)), 6)
##               Estimate Std. Error    z value Pr(>|z|)
## (Intercept)  -2.554923   0.160423 -15.926206 0.000000
## PARENT1       0.217757   0.119008   1.829758 0.067286
## MSTATUS      -0.694028   0.074400  -9.328352 0.000000
## CAR_USE      -0.688893   0.080778  -8.528279 0.000000
## REVOKED       0.733365   0.079366   9.240333 0.000000
## URBANICITY    2.249695   0.111388  20.196878 0.000000
## PHD          -0.791338   0.133765  -5.915870 0.000000
## MASTERS      -0.508002   0.105529  -4.813852 0.000001
## BACHELORS    -0.485927   0.077640  -6.258734 0.000000
## CLERICAL      0.382111   0.106668   3.582245 0.000341
## MANAGER      -0.690384   0.109689  -6.293993 0.000000
## STUDENT       0.184715   0.126539   1.459743 0.144361
## BLUE_COLLAR   0.250666   0.103458   2.422876 0.015398
## MINIVAN      -0.435240   0.105575  -4.122578 0.000037
## PICKUP        0.219309   0.096226   2.279105 0.022661
## SPORTS        0.616909   0.118944   5.186530 0.000000
## SUV           0.364174   0.099838   3.647647 0.000265
## log_INCOME    0.100999   0.015677   6.442715 0.000000
## log_TRAVTIME -0.028743   0.007072  -4.064134 0.000048
## log_BLUEBOOK  0.059508   0.024005   2.479025 0.013174
## log_TIF       0.007729   0.001190   6.493015 0.000000
## log_MVR_PTS  -0.002189   0.000437  -5.008679 0.000001
## cat_KIDSDRIV  0.542256   0.094714   5.725173 0.000000
## cat_HOMEKIDS  0.252943   0.087625   2.886646 0.003894
## cat_CLM_FREQ  0.557205   0.060751   9.171972 0.000000

At a significance level of \(\alpha=0.5\), the \(R_{adj}^2\) selected variables yield a TARGET_FLAG BLR model with two insignificant variables: PARENT1 and STUDENT. Removing those three insignificant variables yields a model with all significant variables.

adjustedr2_FLAG <- glm(TARGET_FLAG ~ 1 + MSTATUS + CAR_USE + REVOKED + 
    URBANICITY + PHD + MASTERS + BACHELORS + CLERICAL + MANAGER + 
    BLUE_COLLAR + MINIVAN + PICKUP + SPORTS + SUV + log_INCOME + 
    log_TRAVTIME + log_BLUEBOOK + log_TIF + log_MVR_PTS + cat_KIDSDRIV + 
    cat_HOMEKIDS + cat_CLM_FREQ, family = binomial(link = "logit"), training_FLAG)
summary(adjustedr2_FLAG)
## 
## Call:
## glm(formula = TARGET_FLAG ~ 1 + MSTATUS + CAR_USE + REVOKED + 
##     URBANICITY + PHD + MASTERS + BACHELORS + CLERICAL + MANAGER + 
##     BLUE_COLLAR + MINIVAN + PICKUP + SPORTS + SUV + log_INCOME + 
##     log_TRAVTIME + log_BLUEBOOK + log_TIF + log_MVR_PTS + cat_KIDSDRIV + 
##     cat_HOMEKIDS + cat_CLM_FREQ, family = binomial(link = "logit"), 
##     data = training_FLAG)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.2016  -0.7231  -0.4212   0.6891   3.1351  
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -2.4312711  0.1486779 -16.353  < 2e-16 ***
## MSTATUS      -0.7777777  0.0586439 -13.263  < 2e-16 ***
## CAR_USE      -0.7155003  0.0787814  -9.082  < 2e-16 ***
## REVOKED       0.7343615  0.0793348   9.256  < 2e-16 ***
## URBANICITY    2.2363335  0.1109238  20.161  < 2e-16 ***
## PHD          -0.8497653  0.1275975  -6.660 2.74e-11 ***
## MASTERS      -0.5639725  0.0979155  -5.760 8.42e-09 ***
## BACHELORS    -0.5101645  0.0752438  -6.780 1.20e-11 ***
## CLERICAL      0.3256558  0.0990172   3.289  0.00101 ** 
## MANAGER      -0.7098209  0.1082712  -6.556 5.53e-11 ***
## BLUE_COLLAR   0.1852830  0.0928333   1.996  0.04595 *  
## MINIVAN      -0.4147417  0.1048748  -3.955 7.67e-05 ***
## PICKUP        0.2317109  0.0958995   2.416  0.01568 *  
## SPORTS        0.6331507  0.1186074   5.338 9.39e-08 ***
## SUV           0.3802718  0.0993833   3.826  0.00013 ***
## log_INCOME    0.1067877  0.0151806   7.034 2.00e-12 ***
## log_TRAVTIME -0.0285046  0.0070522  -4.042 5.30e-05 ***
## log_BLUEBOOK  0.0604612  0.0239746   2.522  0.01167 *  
## log_TIF       0.0077168  0.0011898   6.486 8.82e-11 ***
## log_MVR_PTS  -0.0021850  0.0004367  -5.004 5.63e-07 ***
## cat_KIDSDRIV  0.5297128  0.0942272   5.622 1.89e-08 ***
## cat_HOMEKIDS  0.3609836  0.0685862   5.263 1.42e-07 ***
## cat_CLM_FREQ  0.5555339  0.0607168   9.150  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 9418.0  on 8160  degrees of freedom
## Residual deviance: 7446.2  on 8138  degrees of freedom
## AIC: 7492.2
## 
## Number of Fisher Scoring iterations: 5

This model is also identical to the model derived using Backward Elimination.

Mallows \(C_p\)

par(mfrow=c(1,2))
plot(model_sum_AMT$cp, xlab = "Number of Variables", ylab = "Mallows Cp", main="TARGET_AMT")
plot(model_sum_FLAG$cp, xlab = "Number of Variables", ylab = "Mallows Cp", main="TARGET_FLAG")

cbind(min(model_sum_AMT$cp), which.min(model_sum_AMT$cp))
##          [,1] [,2]
## [1,] 12.81883   16
cbind(min(model_sum_FLAG$cp), which.min(model_sum_FLAG$cp))
##          [,1] [,2]
## [1,] 19.25552   24

Multiple Linear Regression

The minimum Mallows \(C_p\) 12.8188345 for the model predicting TARGET_AMT is reached when the model contains 16 variables.

model_sum_AMT$which[which.min(model_sum_AMT$cp), ]
##  (Intercept)          AGE      PARENT1      MSTATUS      CAR_USE 
##         TRUE        FALSE         TRUE         TRUE         TRUE 
##      REVOKED   URBANICITY          PHD      MASTERS           HS 
##         TRUE         TRUE        FALSE        FALSE         TRUE 
##         NOHS      MANAGER      STUDENT  BLUE_COLLAR      MINIVAN 
##         TRUE         TRUE        FALSE        FALSE         TRUE 
##        TRUCK       PICKUP       SPORTS          VAN   log_INCOME 
##        FALSE        FALSE         TRUE        FALSE         TRUE 
## log_TRAVTIME log_BLUEBOOK      log_TIF  log_MVR_PTS cat_KIDSDRIV 
##         TRUE        FALSE         TRUE         TRUE         TRUE 
## cat_HOMEKIDS      cat_YOJ cat_OLDCLAIM cat_CLM_FREQ  cat_CAR_AGE 
##        FALSE        FALSE         TRUE        FALSE        FALSE
mallowscp_AMT <- lm(TARGET_AMT ~ 1 + PARENT1 + MSTATUS + CAR_USE + 
    REVOKED + URBANICITY + HS + NOHS + MANAGER + MINIVAN + SPORTS + 
    log_INCOME + log_TRAVTIME + log_TIF + log_MVR_PTS + 
    cat_KIDSDRIV + cat_CLM_FREQ, training_AMT)
round(coef(summary(mallowscp_AMT)), 6)
##                 Estimate Std. Error   t value Pr(>|t|)
## (Intercept)   762.784417 183.151125  4.164782 0.000031
## PARENT1       640.018132 177.569345  3.604328 0.000315
## MSTATUS      -586.551921 119.723526 -4.899220 0.000001
## CAR_USE      -785.707613 111.900046 -7.021513 0.000000
## REVOKED       470.841555 155.143313  3.034881 0.002414
## URBANICITY   1505.218992 136.333458 11.040716 0.000000
## HS            470.034608 120.940482  3.886495 0.000103
## NOHS          657.777887 153.639789  4.281299 0.000019
## MANAGER      -851.651499 162.435546 -5.243012 0.000000
## MINIVAN      -471.449326 121.477290 -3.880967 0.000105
## SPORTS        322.212013 168.353540  1.913901 0.055668
## log_INCOME     46.035671  25.544003  1.802210 0.071549
## log_TRAVTIME  -18.636693  11.105620 -1.678132 0.093360
## log_TIF         6.989195   2.130538  3.280483 0.001041
## log_MVR_PTS    -2.773052   0.757597 -3.660325 0.000253
## cat_KIDSDRIV  691.794967 162.331388  4.261622 0.000021
## cat_CLM_FREQ  552.008131 114.513722  4.820454 0.000001

At a significance level of \(\alpha=0.5\), the \(R_{adj}^2\) selected variables yield a TARGET_AMT MLR model with three insignificant variables: SPORTS, log_INCOME, and log_TRAVTIME. Removing those three insignificant variables yields a model with all significant variables.

mallowscp_AMT <- lm(TARGET_AMT ~ 1 + PARENT1 + MSTATUS + CAR_USE + 
    REVOKED + URBANICITY + HS + NOHS + MANAGER + MINIVAN + log_TIF + 
    log_MVR_PTS + cat_KIDSDRIV + cat_CLM_FREQ, training_AMT)
summary(mallowscp_AMT)
## 
## Call:
## lm(formula = TARGET_AMT ~ 1 + PARENT1 + MSTATUS + CAR_USE + REVOKED + 
##     URBANICITY + HS + NOHS + MANAGER + MINIVAN + log_TIF + log_MVR_PTS + 
##     cat_KIDSDRIV + cat_CLM_FREQ, data = training_AMT)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -4997  -1694   -819    354 104984 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   774.2430   182.3082   4.247 2.19e-05 ***
## PARENT1       655.1922   177.4678   3.692 0.000224 ***
## MSTATUS      -581.7175   119.7388  -4.858 1.21e-06 ***
## CAR_USE      -730.5026   109.4617  -6.674 2.66e-11 ***
## REVOKED       473.8986   155.1903   3.054 0.002268 ** 
## URBANICITY   1469.2228   135.8413  10.816  < 2e-16 ***
## HS            496.2580   120.3904   4.122 3.79e-05 ***
## NOHS          676.2812   153.2939   4.412 1.04e-05 ***
## MANAGER      -884.6343   162.0158  -5.460 4.90e-08 ***
## MINIVAN      -540.8132   117.7404  -4.593 4.43e-06 ***
## log_TIF         7.1657     2.1302   3.364 0.000772 ***
## log_MVR_PTS    -2.8086     0.7578  -3.706 0.000212 ***
## cat_KIDSDRIV  682.5059   162.3736   4.203 2.66e-05 ***
## cat_CLM_FREQ  571.3661   114.3887   4.995 6.01e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4560 on 8147 degrees of freedom
## Multiple R-squared:  0.0617, Adjusted R-squared:  0.0602 
## F-statistic: 41.21 on 13 and 8147 DF,  p-value: < 2.2e-16

This model is also identical to the model derived using Backward Elimination.

Binary Logistic Regression

The minimum Mallows \(C_p\) of 19.2555192 for the model predicting TARGET_FLAG is reached when the model contains 24 variables.

model_sum_FLAG$which[which.min(model_sum_FLAG$cp), ]
##  (Intercept)          AGE      PARENT1      MSTATUS      CAR_USE 
##         TRUE        FALSE         TRUE         TRUE         TRUE 
##      REVOKED   URBANICITY          PHD      MASTERS    BACHELORS 
##         TRUE         TRUE         TRUE         TRUE         TRUE 
##           HS         NOHS     CLERICAL      MANAGER         PROF 
##        FALSE        FALSE         TRUE         TRUE        FALSE 
##      STUDENT  BLUE_COLLAR      MINIVAN       PICKUP       SPORTS 
##         TRUE         TRUE         TRUE         TRUE         TRUE 
##          SUV   log_INCOME log_TRAVTIME log_BLUEBOOK      log_TIF 
##         TRUE         TRUE         TRUE         TRUE         TRUE 
##  log_MVR_PTS cat_KIDSDRIV cat_HOMEKIDS      cat_YOJ cat_OLDCLAIM 
##         TRUE         TRUE         TRUE        FALSE         TRUE 
## cat_CLM_FREQ  cat_CAR_AGE 
##        FALSE        FALSE
mallowscp_FLAG <- glm(TARGET_FLAG ~ 1 + PARENT1 + MSTATUS + CAR_USE +
    REVOKED + URBANICITY + PHD + MASTERS + BACHELORS + CLERICAL +
    MANAGER + STUDENT + BLUE_COLLAR + MINIVAN + PICKUP + SPORTS +
    SUV + log_INCOME + log_TRAVTIME + log_BLUEBOOK + log_TIF + 
    log_MVR_PTS + cat_KIDSDRIV + cat_HOMEKIDS + cat_CLM_FREQ, 
    family = binomial(link = "logit"), training_FLAG)
round(coef(summary(mallowscp_FLAG)), 6)
##               Estimate Std. Error    z value Pr(>|z|)
## (Intercept)  -2.554923   0.160423 -15.926206 0.000000
## PARENT1       0.217757   0.119008   1.829758 0.067286
## MSTATUS      -0.694028   0.074400  -9.328352 0.000000
## CAR_USE      -0.688893   0.080778  -8.528279 0.000000
## REVOKED       0.733365   0.079366   9.240333 0.000000
## URBANICITY    2.249695   0.111388  20.196878 0.000000
## PHD          -0.791338   0.133765  -5.915870 0.000000
## MASTERS      -0.508002   0.105529  -4.813852 0.000001
## BACHELORS    -0.485927   0.077640  -6.258734 0.000000
## CLERICAL      0.382111   0.106668   3.582245 0.000341
## MANAGER      -0.690384   0.109689  -6.293993 0.000000
## STUDENT       0.184715   0.126539   1.459743 0.144361
## BLUE_COLLAR   0.250666   0.103458   2.422876 0.015398
## MINIVAN      -0.435240   0.105575  -4.122578 0.000037
## PICKUP        0.219309   0.096226   2.279105 0.022661
## SPORTS        0.616909   0.118944   5.186530 0.000000
## SUV           0.364174   0.099838   3.647647 0.000265
## log_INCOME    0.100999   0.015677   6.442715 0.000000
## log_TRAVTIME -0.028743   0.007072  -4.064134 0.000048
## log_BLUEBOOK  0.059508   0.024005   2.479025 0.013174
## log_TIF       0.007729   0.001190   6.493015 0.000000
## log_MVR_PTS  -0.002189   0.000437  -5.008679 0.000001
## cat_KIDSDRIV  0.542256   0.094714   5.725173 0.000000
## cat_HOMEKIDS  0.252943   0.087625   2.886646 0.003894
## cat_CLM_FREQ  0.557205   0.060751   9.171972 0.000000

At a significance level of \(\alpha=0.5\), the \(R_{adj}^2\) selected variables yield a TARGET_FLAG BLR model with two insignificant variables: PARENT1 and STUDENT. Removing those three insignificant variables yields a model with all significant variables.

mallowscp_FLAG <- glm(TARGET_FLAG ~ 1 + MSTATUS + CAR_USE + 
    REVOKED + URBANICITY + PHD + MASTERS + BACHELORS + CLERICAL +
    MANAGER + BLUE_COLLAR + MINIVAN + PICKUP + SPORTS + SUV +
    log_INCOME + log_TRAVTIME + log_BLUEBOOK + log_TIF +
    log_MVR_PTS + cat_KIDSDRIV + cat_HOMEKIDS + cat_CLM_FREQ, 
    family = binomial(link = "logit"), training_FLAG)
summary(mallowscp_FLAG)
## 
## Call:
## glm(formula = TARGET_FLAG ~ 1 + MSTATUS + CAR_USE + REVOKED + 
##     URBANICITY + PHD + MASTERS + BACHELORS + CLERICAL + MANAGER + 
##     BLUE_COLLAR + MINIVAN + PICKUP + SPORTS + SUV + log_INCOME + 
##     log_TRAVTIME + log_BLUEBOOK + log_TIF + log_MVR_PTS + cat_KIDSDRIV + 
##     cat_HOMEKIDS + cat_CLM_FREQ, family = binomial(link = "logit"), 
##     data = training_FLAG)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.2016  -0.7231  -0.4212   0.6891   3.1351  
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -2.4312711  0.1486779 -16.353  < 2e-16 ***
## MSTATUS      -0.7777777  0.0586439 -13.263  < 2e-16 ***
## CAR_USE      -0.7155003  0.0787814  -9.082  < 2e-16 ***
## REVOKED       0.7343615  0.0793348   9.256  < 2e-16 ***
## URBANICITY    2.2363335  0.1109238  20.161  < 2e-16 ***
## PHD          -0.8497653  0.1275975  -6.660 2.74e-11 ***
## MASTERS      -0.5639725  0.0979155  -5.760 8.42e-09 ***
## BACHELORS    -0.5101645  0.0752438  -6.780 1.20e-11 ***
## CLERICAL      0.3256558  0.0990172   3.289  0.00101 ** 
## MANAGER      -0.7098209  0.1082712  -6.556 5.53e-11 ***
## BLUE_COLLAR   0.1852830  0.0928333   1.996  0.04595 *  
## MINIVAN      -0.4147417  0.1048748  -3.955 7.67e-05 ***
## PICKUP        0.2317109  0.0958995   2.416  0.01568 *  
## SPORTS        0.6331507  0.1186074   5.338 9.39e-08 ***
## SUV           0.3802718  0.0993833   3.826  0.00013 ***
## log_INCOME    0.1067877  0.0151806   7.034 2.00e-12 ***
## log_TRAVTIME -0.0285046  0.0070522  -4.042 5.30e-05 ***
## log_BLUEBOOK  0.0604612  0.0239746   2.522  0.01167 *  
## log_TIF       0.0077168  0.0011898   6.486 8.82e-11 ***
## log_MVR_PTS  -0.0021850  0.0004367  -5.004 5.63e-07 ***
## cat_KIDSDRIV  0.5297128  0.0942272   5.622 1.89e-08 ***
## cat_HOMEKIDS  0.3609836  0.0685862   5.263 1.42e-07 ***
## cat_CLM_FREQ  0.5555339  0.0607168   9.150  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 9418.0  on 8160  degrees of freedom
## Residual deviance: 7446.2  on 8138  degrees of freedom
## AIC: 7492.2
## 
## Number of Fisher Scoring iterations: 5

This model is also identical to the model derived using Backward Elimination.

Select Models

Multiple Linear Regression Model

Decide on the criteria for selecting the best multiple linear regression model. Will you select models with slightly worse performance if it makes more sense or is more parsimonious? Discuss why you selected your models.

For the multiple linear regression model, will you use a metric such as Adjusted R2, RMSE, etc.? Be sure to explain how you can make inferences from the model, discuss multi-collinearity issues (if any), and discuss other relevant model output. Using the training data set, evaluate the multiple linear regression model based on (a) mean squared error, (b) R2, (c) F-statistic, and (d) residual plots.

sum1 <- summary(forward_AMT)
sum2 <- summary(backward_AMT)
sum3 <- summary(adjustedr2_AMT)
sum4 <- summary(mallowscp_AMT)

Multicollinearity

library(lmtest)
dwtest(forward_AMT)
## 
##  Durbin-Watson test
## 
## data:  forward_AMT
## DW = 1.988, p-value = 0.294
## alternative hypothesis: true autocorrelation is greater than 0
dwtest(backward_AMT)
## 
##  Durbin-Watson test
## 
## data:  backward_AMT
## DW = 1.988, p-value = 0.2932
## alternative hypothesis: true autocorrelation is greater than 0
dwtest(adjustedr2_AMT)
## 
##  Durbin-Watson test
## 
## data:  adjustedr2_AMT
## DW = 1.988, p-value = 0.2932
## alternative hypothesis: true autocorrelation is greater than 0
dwtest(mallowscp_AMT)
## 
##  Durbin-Watson test
## 
## data:  mallowscp_AMT
## DW = 1.988, p-value = 0.2932
## alternative hypothesis: true autocorrelation is greater than 0

The null hypothesis is that there does not exist autocorrelation (multicollinearity). Since the p-valuea are large, we fail to reject the null hypothesis.

Mean Squared Error and RMSE

summary(M[1:n, "TARGET_AMT"])
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0       0       0    1504    1036  107600
data.frame("MODEL" = c("forward_AMT", "backward_AMT", "adjustedr2_AMT", "mallowscp_AMT"),
  "MSE" = c(sum1$sigma^2, sum2$sigma^2, sum3$sigma^2, sum4$sigma^2),
  "RMSE" = c(sum1$sigma, sum2$sigma, sum3$sigma, sum4$sigma))
##            MODEL      MSE     RMSE
## 1    forward_AMT 20795666 4560.227
## 2   backward_AMT 20795771 4560.238
## 3 adjustedr2_AMT 20795771 4560.238
## 4  mallowscp_AMT 20795771 4560.238

The Mean Squared Error is the square of the RMSE. The benefit of using the RMSE is that it is expressed in the same units as the target variable. For these models, we see that standard error of the mean (RMSE) is fairly large relative to the target variable. In these models, the standard deviation of the unexplained variance in TARGET_AMT are in the neighborhood of 4560 which is a large deviation from the 1504.325 average claim encountered in the data. This model can likely be improved upon by adding, combining, or transforming other variables that were filtered out early in this analysis.

\(R^2\) and Adjusted \(R^2\)

data.frame("MODEL" = c("forward_AMT", "backward_AMT", "adjustedr2_AMT", "mallowscp_AMT"),
  "R.SQUARED" = c(sum1$r.squared, sum2$r.squared, sum3$r.squared, sum4$r.squared),
  "ADJ.R.SQUARED" = c(sum1$adj.r.squared, sum2$adj.r.squared, sum3$adj.r.squared, sum4$adj.r.squared))
##            MODEL  R.SQUARED ADJ.R.SQUARED
## 1    forward_AMT 0.14876730    0.14730452
## 2   backward_AMT 0.06169727    0.06020004
## 3 adjustedr2_AMT 0.06169727    0.06020004
## 4  mallowscp_AMT 0.06169727    0.06020004

\(R^2\) represents the percent change in \(Y\) explained by the predictor variables. \(R^2\) is fairly low for this model. \(R^2\) however, is not an adequate performance measure for this model. Adjusted \(R^2\) is more appropriate when models have multiple variables. It incorporates a penalty to account for the decrease in degrees of freedom (from additional variables). This penalty does not improve the evaluation however. Adjusted \(R^2\) is less than the already low \(R^2\). Again, it is very likely that this model be improved upon.

\(F\)-statistic

data.frame("MODEL" = c("forward_AMT", "backward_AMT", "adjustedr2_AMT", "mallowscp_AMT"),
           rbind(sum1$fstatistic, sum2$fstatistic, sum3$fstatistic, sum4$fstatistic))
##            MODEL    value numdf dendf
## 1    forward_AMT 101.7019    14  8147
## 2   backward_AMT  41.2076    13  8147
## 3 adjustedr2_AMT  41.2076    13  8147
## 4  mallowscp_AMT  41.2076    13  8147

The \(F\)-test evaluates the null hypothesis that all regression coefficients are equal to zero versus the alternative that at least one does not. At an \(\alpha=0.01\) the \(F\)-statistic which indicates that the effect of the model is not “a spurious result of oddities in the data set.”

Examine Residuals

par(mfrow = c(2,2))
plot(forward_AMT)

par(mfrow = c(2,2))
plot(backward_AMT)

par(mfrow = c(2,2))
plot(adjustedr2_AMT)

par(mfrow = c(2,2))
plot(mallowscp_AMT)

The Residuals vs Fitted plot shows that the residuals do not have a linear pattern. The Normal Q-Q plot shows that the residuals are also not normally distributed. The Scale-Location plot appears to show some heteroscedasticity since the line is not horizontal with equally (randomly) spread points. The Residuals vs Leverage plot does show point 7691 extreme values outside the Cooks distance (dashed curve) that influence the (solid) regression line. Point 7691 also stands out in other plots with points 5389 and 7072.

Model Selection

The model derived using Forward Selection has the lowest RMSE, although only by a minimal amount. When it comes to the \(F\)-statistics and Adjusted \(R^2\) however, the forward_AMT model has values that are substantially higher. The chosen model for TARGET_AMT is therefore forward_AMT.

Binary Logistic Regression Model

Decide on the criteria for selecting the best binary logistic regression model. Will you select models with slightly worse performance if it makes more sense or is more parsimonious? Discuss why you selected your models.

For the binary logistic regression model, will you use a metric such as log likelihood, AIC, ROC curve, etc.? Using the training data set, evaluate the binary logistic regression model based on (a) accuracy, (b) classification error rate, (c) precision, (d) sensitivity, (e) specificity, (f) F1 score, (g) AUC, and (h) confusion matrix. Make predictions using the evaluation data set.

Confusion Matrix

library(caret)
training_FLAG[ ,"probability.forward"] <- predict(forward_FLAG, training_FLAG, type="response")
training_FLAG[ ,"class.forward"] <- ifelse(training_FLAG$probability.forward < 0.5, 0, 1)
(cm1 <- confusionMatrix(training_FLAG$class.forward, training_FLAG$TARGET_FLAG, positive = "1"))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 5533 1297
##          1  475  856
##                                           
##                Accuracy : 0.7829          
##                  95% CI : (0.7738, 0.7918)
##     No Information Rate : 0.7362          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.363           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.3976          
##             Specificity : 0.9209          
##          Pos Pred Value : 0.6431          
##          Neg Pred Value : 0.8101          
##              Prevalence : 0.2638          
##          Detection Rate : 0.1049          
##    Detection Prevalence : 0.1631          
##       Balanced Accuracy : 0.6593          
##                                           
##        'Positive' Class : 1               
## 

The model derived using Forward Selection has the following performance metrics: Accuracy of 0.7828697, Error Rate of 0.2171303, Precision of 0.6431255, Sensitivity of 0.3975848, Specificity of 0.9209387, and \(F_1\) Score of 0.4913892.

training_FLAG[ ,"probability.backward"] <- predict(backward_FLAG, training_FLAG, type="response")
training_FLAG[ ,"class.backward"] <- ifelse(training_FLAG$probability.backward < 0.5, 0, 1)
(cm2 <- confusionMatrix(training_FLAG$class.backward, training_FLAG$TARGET_FLAG, positive = "1"))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 5536 1295
##          1  472  858
##                                           
##                Accuracy : 0.7835          
##                  95% CI : (0.7744, 0.7924)
##     No Information Rate : 0.7362          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.3647          
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.3985          
##             Specificity : 0.9214          
##          Pos Pred Value : 0.6451          
##          Neg Pred Value : 0.8104          
##              Prevalence : 0.2638          
##          Detection Rate : 0.1051          
##    Detection Prevalence : 0.1630          
##       Balanced Accuracy : 0.6600          
##                                           
##        'Positive' Class : 1               
## 

The model derived using Backward Elimination has the following performance metrics: Accuracy of 0.7834824, Error Rate of 0.2165176, Precision of 0.6451128, Sensitivity of 0.3985137, Specificity of 0.9214381, and \(F_1\) Score of 0.4926787.

training_FLAG[ ,"probability.adjustedr2"] <- predict(adjustedr2_FLAG, training_FLAG, type="response")
training_FLAG[ ,"class.adjustedr2"] <- ifelse(training_FLAG$probability.adjustedr2 < 0.5, 0, 1)
(cm3 <- confusionMatrix(training_FLAG$class.adjustedr2, training_FLAG$TARGET_FLAG, positive = "1"))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 5536 1295
##          1  472  858
##                                           
##                Accuracy : 0.7835          
##                  95% CI : (0.7744, 0.7924)
##     No Information Rate : 0.7362          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.3647          
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.3985          
##             Specificity : 0.9214          
##          Pos Pred Value : 0.6451          
##          Neg Pred Value : 0.8104          
##              Prevalence : 0.2638          
##          Detection Rate : 0.1051          
##    Detection Prevalence : 0.1630          
##       Balanced Accuracy : 0.6600          
##                                           
##        'Positive' Class : 1               
## 

The model derived using Adjusted \(R^2\) has the following performance metrics: Accuracy of 0.7834824, Error Rate of 0.2165176, Precision of , Sensitivity of 0.3985137, Specificity of 0.9214381, and \(F_1\) Score of 0.4926787. These metrics are identical to those from Backward Elimination since, as previously mentioned, both models identical.

training_FLAG[ ,"probability.mallowscp"] <- predict(mallowscp_FLAG, training_FLAG, type="response")
training_FLAG[ ,"class.mallowscp"] <- ifelse(training_FLAG$probability.mallowscp < 0.5, 0, 1)
(cm4 <- confusionMatrix(training_FLAG$class.mallowscp, training_FLAG$TARGET_FLAG, positive = "1"))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 5536 1295
##          1  472  858
##                                           
##                Accuracy : 0.7835          
##                  95% CI : (0.7744, 0.7924)
##     No Information Rate : 0.7362          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.3647          
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.3985          
##             Specificity : 0.9214          
##          Pos Pred Value : 0.6451          
##          Neg Pred Value : 0.8104          
##              Prevalence : 0.2638          
##          Detection Rate : 0.1051          
##    Detection Prevalence : 0.1630          
##       Balanced Accuracy : 0.6600          
##                                           
##        'Positive' Class : 1               
## 

The model derived using Mallows \(C_p\) has the following performance metrics: Accuracy of 0.7834824, Error Rate of 0.2165176, Precision of , Sensitivity of 0.3985137, Specificity of 0.9214381, and \(F_1\) Score of 0.4926787. These metrics are identical to those from Backward Elimination since, as previously mentioned, both models identical.

ROC Curve

library(pROC)
par(mfrow=c(2,2))
plot(roc(training_FLAG$TARGET_FLAG, training_FLAG$class.forward, smooth=F), print.auc=TRUE)
plot(roc(training_FLAG$TARGET_FLAG, training_FLAG$class.backward, smooth=F), print.auc=TRUE)
plot(roc(training_FLAG$TARGET_FLAG, training_FLAG$probability.adjustedr2, smooth=F), print.auc=TRUE)
plot(roc(training_FLAG$TARGET_FLAG, training_FLAG$probability.mallowscp, smooth=F), print.auc=TRUE)

The models with the highest Accuracy were those derived using Backward Elimination, Adjusted \(R^2\), and Mallows \(C_p\) models yielded identical results. These models with the greatest Area Under the ROC Curve however, were the adjustedr2_AMT and mallowscp_AMT models. As such, the backward_AMT is being eliminated and there is indifference between using the adjustedr2_AMT and mallowscp_AMT models for TARGET_FLAG.

Predictions

Make predictions using the evaluation data set.

validation <- M[(1+n):(m+n),]
probability <- predict(adjustedr2_FLAG, validation, type="response")
predict_FLAG <- ifelse(probability >= .5, 1, 0)
predict_AMT <- predict(forward_AMT, validation)
predict_AMT[predict_FLAG == 0] <- 0
predictions <- data.frame("predict_FLAG" = predict_FLAG, "predict_AMT" = predict_AMT)
display(predictions)

Prevalence

n <- sum(training_FLAG$TARGET_FLAG)
N <- nrow(training_FLAG)
m <- sum(predict_FLAG)
M <- length(predict_FLAG)
binom.test(m, M, n / N)
## 
##  Exact binomial test
## 
## data:  m and M
## number of successes = 383, number of trials = 2141, p-value <
## 2.2e-16
## alternative hypothesis: true probability of success is not equal to 0.2638157
## 95 percent confidence interval:
##  0.1628687 0.1957921
## sample estimates:
## probability of success 
##              0.1788884

The prevalence of the positive condition is 26.38% in the training data and 17.89% in the evaluation data results. Although there is some difference in these figures, the difference is not significant at an \(\alpha = 0.05\) as can be seen in the above Binomial test.

References

https://rpubs.com/josezuniga/262383

https://rpubs.com/josezuniga/253955

http://data.princeton.edu/R/glms.html

http://www.statmethods.net/advstats/glm.html

http://www.theanalysisfactor.com/what-is-logit-function/

http://stats.stackexchange.com/questions/186081/is-multicollinearity-an-issue-when-doing-stepwise-logistic-regression-using-aic