In this homework assignment, you will explore, analyze and model a data set containing approximately 8000 records representing a customer at an auto insurance company. Each record has two response variables. The first response variable, TARGET_FLAG
, is a 1 or a 0. A “1” means that the person was in a car crash. A zero means that the person was not in a car crash. The second response variable is TARGET_AMT
. This value is zero if the person did not crash their car. But if they did crash their car, this number will be a value greater than zero.
Your objective is to build multiple linear regression and binary logistic regression models on the training data to predict the probability that a person will crash their car and also the amount of money it will cost if the person does crash their car. You can only use the variables given to you (or variables that you derive from the variables provided). Below is a short description of the variables of interest in the data set:
VARIABLE | DESCRIPTION | THEORETICAL EFFECT |
---|---|---|
INDEX |
Identification Variable (do not use) | None |
TARGET_FLAG |
Was Car in a crash? 1=YES 0=NO | None |
TARGET_AMT |
If car was in a crash, what was the cost | None |
AGE |
Age of Driver | Very young people tend to be risky; maybe very old people also |
BLUEBOOK |
Value of Vehicle | Unknown effect, but probably effect the payout if there is a crash |
CAR_AGE |
Vehicle Age | Unknown effect, but probably effect the payout if there is a crash |
CAR_TYPE |
Type of Car | Unknown effect, but probably effect the payout if there is a crash |
CAR_USE |
Vehicle Use | Commercial vehicles are driven more, so might increase probability of collision |
CLM_FREQ |
# Claims (Past 5 Years) | The more claims filed in the past, the more likely to file in the future |
EDUCATION |
Max Education Level | Unknown effect, but in theory more educated people tend to drive more safely |
HOMEKIDS |
# Children at Home | Unknown effect |
HOME_VAL |
Home Value | In theory, home owners tend to drive more responsibly |
INCOME |
Income | In theory, rich people tend to get into fewer crashes |
JOB |
Job Category | In theory, white collar jobs tend to be safer |
KIDSDRIV |
# Driving Children | When teenagers drive a car, they are more likely to get into crashes |
MSTATUS |
Marital Status | In theory, married people drive more safely |
MVR_PTS |
Motor Vehicle Record Points | Drivers with lots of traffic tickets tend to get into more crashes |
OLDCLAIM |
Total Claims Past 5 Years | A high total payout over past five years suggests high future payouts |
PARENT1 |
Single Parent | Unknown effect |
RED_CAR |
A Red Car | Urban legend says that red cars (especially red sports cars) are more risky |
REVOKED |
License Revoked Past 7 Years | If license revoked in past 7 years, driver is probably more risky |
SEX |
Gender | Urban legend says that women have less crashes then men |
TIF |
Time in Force | People who have been customers for a long time are usually more safe |
TRAVTIME |
Distance to Work | Long drives to work usually suggest greater risk |
URBANICITY |
Home/Work Area | Unknown |
YOJ |
Years on Job | People who stay at a job for a long time are usually more safe |
A write-up submitted in PDF format. Your write-up should have four sections. Each one is described below. You may assume you are addressing me as a fellow data scientist, so do not need to shy away from technical details. Assign predictions (probabilities, classifications, cost) to the evaluation data set. Use 0.5 threshold.
training <- read.csv(paste0("https://raw.githubusercontent.com/jzuniga123/",
"SPS/master/DATA%20621/insurance_training_data.csv"), na.strings = "")
evaluation <- read.csv(paste0("https://raw.githubusercontent.com/jzuniga123/",
"SPS/master/DATA%20621/insurance-evaluation-data.csv"), na.strings = "")
M <- rbind(training, evaluation) # Merged
n <- nrow(training); # training is M[1:n, 4:26]
m <- nrow(evaluation) # evaluation is M[(1+n):(m+n), ]
X <- data.frame("TARGET_FLAG" = rep(T, ncol(M)),
"TARGET_AMT" = rep(T, ncol(M)))
X[match(c("INDEX", "TARGET_AMT"), names(M)), "TARGET_FLAG"] <- F
X[match(c("INDEX", "TARGET_FLAG"), names(M)), "TARGET_AMT"] <- F
quantitative <- c(4:8, 10, 15, 17, 18, 21, 22, 24, 25)
names(M[quantitative])
## [1] "KIDSDRIV" "AGE" "HOMEKIDS" "YOJ" "INCOME" "HOME_VAL"
## [7] "TRAVTIME" "BLUEBOOK" "TIF" "OLDCLAIM" "CLM_FREQ" "MVR_PTS"
## [13] "CAR_AGE"
categorical <- c(13, 14, 19)
names(M[categorical])
## [1] "EDUCATION" "JOB" "CAR_TYPE"
binary <- c(9, 11, 12, 16, 20, 23, 26)
names(M[binary])
## [1] "PARENT1" "MSTATUS" "SEX" "CAR_USE" "RED_CAR"
## [6] "REVOKED" "URBANICITY"
Currency_Convert <- function(Field){
Field <- as.numeric(gsub("\\$|,","", Field))
}
Binary_Convert <- function(Field, Neg, Pos) {
Field <- as.character(Field)
Field[which(Field == Neg)] <- 0
Field[which(Field == Pos)] <- 1
Field <- as.numeric(Field)
}
M$INCOME <- Currency_Convert(M$INCOME)
M$PARENT1 <- Binary_Convert(M$PARENT1, "No", "Yes")
M$HOME_VAL <- Currency_Convert(M$HOME_VAL)
M$MSTATUS <- Binary_Convert(M$MSTATUS, "z_No", "Yes")
M$SEX <- Binary_Convert(M$SEX, "M", "z_F")
M$CAR_USE <- Binary_Convert(M$CAR_USE, "Commercial", "Private")
M$BLUEBOOK <- Currency_Convert(M$BLUEBOOK)
M$RED_CAR <- Binary_Convert(M$RED_CAR, "no", "yes")
M$OLDCLAIM <- Currency_Convert(M$OLDCLAIM)
M$REVOKED <- Binary_Convert(M$REVOKED, "No", "Yes")
M$URBANICITY <- Binary_Convert(M$URBANICITY, "z_Highly Rural/ Rural", "Highly Urban/ Urban")
M$CAR_AGE[which(M$CAR_AGE < 0)] <- NA
M$HOME_VAL[which(M$HOME_VAL == 0)] <- NA
Several variables use currency formatting or are categorical in nature. These were converted to numeric and binary, respectively.
Describe the size and the variables in the insurance training data set. Consider that too much detail will cause a manager to lose interest while too little detail will make the manager consider that you aren’t doing your job. Some suggestions are given below. Please do NOT treat this as a check list of things to do to complete the assignment. You should have your own thoughts on what to tell the boss. These are just ideas.
library("DT")
display <- function(data) {
datatable(data, options = list(
searching = TRUE,
pageLength = 5,
lengthMenu = c(5, nrow(data))
), rownames = FALSE)
}
all(complete.cases(M[1:n, -1]))
## [1] FALSE
summary(M[1:n, -1])
## TARGET_FLAG TARGET_AMT KIDSDRIV AGE
## Min. :0.0000 Min. : 0 Min. :0.0000 Min. :16.00
## 1st Qu.:0.0000 1st Qu.: 0 1st Qu.:0.0000 1st Qu.:39.00
## Median :0.0000 Median : 0 Median :0.0000 Median :45.00
## Mean :0.2638 Mean : 1504 Mean :0.1711 Mean :44.79
## 3rd Qu.:1.0000 3rd Qu.: 1036 3rd Qu.:0.0000 3rd Qu.:51.00
## Max. :1.0000 Max. :107586 Max. :4.0000 Max. :81.00
## NA's :6
## HOMEKIDS YOJ INCOME PARENT1
## Min. :0.0000 Min. : 0.0 Min. : 0 Min. :0.000
## 1st Qu.:0.0000 1st Qu.: 9.0 1st Qu.: 28097 1st Qu.:0.000
## Median :0.0000 Median :11.0 Median : 54028 Median :0.000
## Mean :0.7212 Mean :10.5 Mean : 61898 Mean :0.132
## 3rd Qu.:1.0000 3rd Qu.:13.0 3rd Qu.: 85986 3rd Qu.:0.000
## Max. :5.0000 Max. :23.0 Max. :367030 Max. :1.000
## NA's :454 NA's :445
## HOME_VAL MSTATUS SEX EDUCATION
## Min. : 50223 Min. :0.0000 Min. :0.0000 <High School :1203
## 1st Qu.:153074 1st Qu.:0.0000 1st Qu.:0.0000 Bachelors :2242
## Median :206692 Median :1.0000 Median :1.0000 Masters :1658
## Mean :220621 Mean :0.5997 Mean :0.5361 PhD : 728
## 3rd Qu.:270023 3rd Qu.:1.0000 3rd Qu.:1.0000 z_High School:2330
## Max. :885282 Max. :1.0000 Max. :1.0000
## NA's :2758
## JOB TRAVTIME CAR_USE BLUEBOOK
## z_Blue Collar:1825 Min. : 5.00 Min. :0.0000 Min. : 1500
## Clerical :1271 1st Qu.: 22.00 1st Qu.:0.0000 1st Qu.: 9280
## Professional :1117 Median : 33.00 Median :1.0000 Median :14440
## Manager : 988 Mean : 33.49 Mean :0.6288 Mean :15710
## Lawyer : 835 3rd Qu.: 44.00 3rd Qu.:1.0000 3rd Qu.:20850
## (Other) :1599 Max. :142.00 Max. :1.0000 Max. :69740
## NA's : 526
## TIF CAR_TYPE RED_CAR OLDCLAIM
## Min. : 1.000 Minivan :2145 Min. :0.0000 Min. : 0
## 1st Qu.: 1.000 Panel Truck: 676 1st Qu.:0.0000 1st Qu.: 0
## Median : 4.000 Pickup :1389 Median :0.0000 Median : 0
## Mean : 5.351 Sports Car : 907 Mean :0.2914 Mean : 4037
## 3rd Qu.: 7.000 Van : 750 3rd Qu.:1.0000 3rd Qu.: 4636
## Max. :25.000 z_SUV :2294 Max. :1.0000 Max. :57037
##
## CLM_FREQ REVOKED MVR_PTS CAR_AGE
## Min. :0.0000 Min. :0.0000 Min. : 0.000 Min. : 0.00
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.: 0.000 1st Qu.: 1.00
## Median :0.0000 Median :0.0000 Median : 1.000 Median : 8.00
## Mean :0.7986 Mean :0.1225 Mean : 1.696 Mean : 8.33
## 3rd Qu.:2.0000 3rd Qu.:0.0000 3rd Qu.: 3.000 3rd Qu.:12.00
## Max. :5.0000 Max. :1.0000 Max. :13.000 Max. :28.00
## NA's :511
## URBANICITY
## Min. :0.0000
## 1st Qu.:1.0000
## Median :1.0000
## Mean :0.7955
## 3rd Qu.:1.0000
## Max. :1.0000
##
Looking at the data summaries for the training dataset, we can see that several variables have notable amounts of NA’s, with YOJ
and CAR_AGE
among the highest. We can also see multiple skewed variables with large maximums relative to their mean and median.
par(mfrow = c(4,4), cex=.4)
for (i in c(quantitative, categorical)) {
plot(M[1:n, i], main = names(M[i]))
}
Not many of the nonbinary variables appear to approach normality.
plot(M[1:n, c(quantitative, categorical)])
Although it is somewhat hard to see here, relationships exist between many of the quantitative variables.
par(mfrow = c(2,4))
for (i in c(quantitative, categorical)) {
if(is.numeric(M[1:n, i])) {
hist(M[1:n, i], xlab = names(M[i]), main = names(M[i]))
d <- density(M[1:n, i], na.rm=T)
}
else {
plot(M[1:n, i], xlab = names(M[i]), main = names(M[i]))
d <- density(as.numeric(M[1:n, i]), na.rm=T)
}
plot(d, main = names(M[i]))
polygon(d, col="red")
}
The histograms and density plots give a better understanding of how the nonbinary data are distributed. The distribution of the variable AGE
could pass for some variation of a Gaussian or Student-\(t\) distribution. The variables INCOME
, HOME_VAL
, TRAVTIME
, BLUEBOOK
, TIF
, and MVR_PTS
show heavy skewing. The variables KIDSDRIV
, HOMEKIDS
, YOJ
, OLDCLAIM
, CLM_FREQ
, and CAR_AGE
have multimodal distributions.
library(ggplot2)
library(reshape2)
ggplot(data = melt(abs(cor(sapply(na.omit(training), as.numeric)))), aes(x=Var1, y=Var2, fill=value)) +
scale_fill_gradient(low = 'black', high = 'red', name = "Absolute Value") +
geom_tile() + labs(title = "Correlation Heatmap") +
theme(axis.title.y = element_blank(),
axis.title.x = element_blank(),
axis.text.x = element_text(angle = 90, hjust = 1),
plot.title = element_text(hjust = 0.5))
There is some correlation visible between the predictor variables and the response variables, but not very much. The noteable correlations are those between the predictor variables themselves: The variable MVR_PTS
is correlated with CLM_FREQ
and OLD_CLM
; RED_CAR
is correlated with SEX
and CAR_TYPE
; CAR_USE
is correlated with JOB
; MSTATUS
is correlated with PARENT1
and HOME_VAL
; and HOMEKIDS
is correlated with AGE
and PARENT1
.
PCA <- function(X) {
Xpca <- prcomp(na.omit(X), center = T, scale. = T)
M <- as.matrix(na.omit(X)); R <- as.matrix(Xpca$rotation); score <- M %*% R
print(list("Importance of Components" = summary(Xpca)$importance[ ,1:5],
"Rotation (Variable Loadings)" = Xpca$rotation[ ,1:5],
"Correlation between X and PC" = cor(na.omit(X), score)[ ,1:5]))
par(mfrow=c(2,3))
barplot(Xpca$sdev^2, ylab = "Component Variance")
barplot(cor(cbind(X)), ylab = "Correlations")
barplot(Xpca$rotation, ylab = "Loadings")
biplot(Xpca); barplot(M); barplot(score)
}
PCA(M[1:n, quantitative])
## $`Importance of Components`
## PC1 PC2 PC3 PC4 PC5
## Standard deviation 1.645244 1.334995 1.270056 1.012738 0.9938183
## Proportion of Variance 0.208220 0.137090 0.124080 0.078900 0.0759700
## Cumulative Proportion 0.208220 0.345310 0.469390 0.548290 0.6242600
##
## $`Rotation (Variable Loadings)`
## PC1 PC2 PC3 PC4 PC5
## KIDSDRIV 0.10774567 -0.17584383 0.50530195 -0.07331372 0.083069165
## AGE -0.26771011 0.02800193 -0.33747972 0.09204957 0.398329669
## HOMEKIDS 0.22780686 -0.15800486 0.61071654 -0.03742552 -0.070548997
## YOJ -0.14018772 -0.15622254 0.30537124 0.16303338 0.482290631
## INCOME -0.53361654 -0.15978248 0.16457623 -0.05878797 -0.111702800
## HOME_VAL -0.54258303 -0.15297470 0.13345360 -0.05556924 -0.070094419
## TRAVTIME 0.03412860 0.01807301 -0.03461120 -0.66695073 0.612187440
## BLUEBOOK -0.34153090 -0.09792299 0.09498366 -0.05047749 0.005257043
## TIF 0.01113475 0.06077082 0.08346442 0.70417523 0.407479904
## OLDCLAIM 0.10846886 -0.53950381 -0.18816292 0.08137372 0.027275343
## CLM_FREQ 0.12435306 -0.57054584 -0.21337557 0.02350619 0.041684584
## MVR_PTS 0.12635248 -0.48204768 -0.13797155 -0.02119042 -0.055156727
## CAR_AGE -0.32388952 -0.08617776 -0.04999174 0.02775244 -0.181923195
##
## $`Correlation between X and PC`
## PC1 PC2 PC3 PC4 PC5
## KIDSDRIV 0.03279334 0.026553825 -0.03415518 0.03430623 0.03304671
## AGE -0.21530718 -0.207800075 0.21108534 -0.21467167 -0.20309336
## HOMEKIDS 0.14930487 0.138457016 -0.14945042 0.15094329 0.14520501
## YOJ -0.20724065 -0.209543450 0.20479944 -0.20430965 -0.20661069
## INCOME -0.98252529 -0.961368639 0.98208694 -0.97955171 -0.98781092
## HOME_VAL -0.99539680 -0.973818412 0.99092055 -0.99071401 -0.99286141
## TRAVTIME 0.03450769 0.037350720 -0.03374552 0.03207236 0.03588214
## BLUEBOOK -0.43933895 -0.428786579 0.44044267 -0.45038023 -0.40712199
## TIF 0.02086626 0.026174257 -0.01892478 0.01915871 0.02069475
## OLDCLAIM 0.06486343 -0.155449228 -0.13116239 0.13800364 0.07289718
## CLM_FREQ 0.07562867 -0.034894080 -0.10831923 0.11178715 0.07912375
## MVR_PTS 0.07488695 0.009564432 -0.09370134 0.09613452 0.07548462
## CAR_AGE -0.37196326 -0.364430487 0.37128491 -0.37062402 -0.37290793
The correlation matrix between the original variables and the principal components indicates that the majority of the variance in the data is coming from, in order of significance, HOME_VAL
and INCOME
. To a much lesser extent, BLUEBOOK
and CAR_AGE
are also impacting the variance.
Describe how you have transformed the data by changing the original variables or creating new variables. If you did transform the data or create new variables, discuss why you did this. Here are some possible transformations.
library(VIM)
all(complete.cases(training))
## [1] FALSE
aggr(M[1:n, 4:26], bars=F, sortVars=T)
##
## Variables sorted by number of missings:
## Variable Count
## HOME_VAL 0.337948781
## JOB 0.064452886
## CAR_AGE 0.062614876
## YOJ 0.055630437
## INCOME 0.054527631
## AGE 0.000735204
## KIDSDRIV 0.000000000
## HOMEKIDS 0.000000000
## PARENT1 0.000000000
## MSTATUS 0.000000000
## SEX 0.000000000
## EDUCATION 0.000000000
## TRAVTIME 0.000000000
## CAR_USE 0.000000000
## BLUEBOOK 0.000000000
## TIF 0.000000000
## CAR_TYPE 0.000000000
## RED_CAR 0.000000000
## OLDCLAIM 0.000000000
## CLM_FREQ 0.000000000
## REVOKED 0.000000000
## MVR_PTS 0.000000000
## URBANICITY 0.000000000
We have notable amounts of missing values in the HOME_VAL
, JOB
, CAR_AGE
, YOJ
, and INCOME
variables. There is also a small amount of missing values in AGE
.
Likely_Value <- function(Field_1, Field_2, Value) {
# Mode for Field_1 for given Value of Field_2
frequencies <- table(Field_1[which(Field_2 == Value)])
most_frequent <- names(sort(frequencies, decreasing = TRUE)[1])
return(most_frequent)
}
M$JOB[(is.na(M$JOB) & M$EDUCATION == "PhD")] <- Likely_Value(M$JOB, M$EDUCATION, "PhD")
M$JOB[(is.na(M$JOB) & M$EDUCATION == "Masters")] <- Likely_Value(M$JOB, M$EDUCATION, "Masters")
M$JOB[(is.na(M$JOB) & M$EDUCATION == "Bachelors")] <- Likely_Value(M$JOB, M$EDUCATION, "Bachelors")
M$JOB[(is.na(M$JOB) & M$EDUCATION == "z_High School")] <- Likely_Value(M$JOB, M$EDUCATION, "z_High School")
M$JOB[(is.na(M$JOB) & M$EDUCATION == "<High School")] <- Likely_Value(M$JOB, M$EDUCATION, "<High School")
Assuming that education level can serve as a reasonable proxy for a person’s job, the likely JOB
value for each sample given the EDUCATION
value level is imputed for missing JOB
values by looking at the predominant (mode) JOB
value for each EDUCATION
value. For example, if education level E is mostly employee in job J, then where there exists and education level E without missing job information, we assume job J.
library(mice)
## Warning: package 'mice' was built under R version 3.3.3
MICE <- mice(M[1:n, 4:26], predictorMatrix = quickpred(M[1:n, 4:26]), method = "mean", printFlag = F)
M[1:n, 4:26] <- complete(MICE, action = 1)
MICE <- mice(M[(1+n):(m+n), ], predictorMatrix = quickpred(M[(1+n):(m+n), ]), method = "mean", printFlag = F)
M[(1+n):(m+n), ] <- complete(MICE, action = 1)
M$CAR_AGE <- as.integer(M$CAR_AGE)
aggr(M[1:n, 4:26], bars=F, sortVars=T)
##
## Variables sorted by number of missings:
## Variable Count
## KIDSDRIV 0
## AGE 0
## HOMEKIDS 0
## YOJ 0
## INCOME 0
## PARENT1 0
## HOME_VAL 0
## MSTATUS 0
## SEX 0
## EDUCATION 0
## JOB 0
## TRAVTIME 0
## CAR_USE 0
## BLUEBOOK 0
## TIF 0
## CAR_TYPE 0
## RED_CAR 0
## OLDCLAIM 0
## CLM_FREQ 0
## REVOKED 0
## MVR_PTS 0
## CAR_AGE 0
## URBANICITY 0
Missing values were replaced with the mean value using Multivariate Imputation by Chained Equations (MICE).
M$PHD <- ifelse(M$EDUCATION == "PhD", 1, 0)
M$MASTERS <- ifelse(M$EDUCATION == "Masters", 1, 0)
M$BACHELORS <- ifelse(M$EDUCATION == "Bachelors", 1, 0)
M$HS <- ifelse(M$EDUCATION == "z_High School", 1, 0)
M$NOHS <- ifelse(M$EDUCATION == "<High School", 1, 0)
M$CLERICAL <- ifelse(M$JOB == "Clerical", 1, 0)
M$DOCTOR <- ifelse(M$JOB == "Doctor", 1, 0)
M$HOME_MAKER <- ifelse(M$JOB == "Home Maker", 1, 0)
M$LAWYER <- ifelse(M$JOB == "Lawyer", 1, 0)
M$MANAGER <- ifelse(M$JOB == "Manager", 1, 0)
M$PROF <- ifelse(M$JOB == "Professional", 1, 0)
M$STUDENT <- ifelse(M$JOB == "Student", 1, 0)
M$BLUE_COLLAR <- ifelse(M$JOB == "z_Blue Collar", 1, 0)
M$MINIVAN <- ifelse(M$CAR_TYPE == "Minivan", 1, 0)
M$TRUCK <- ifelse(M$CAR_TYPE == "Panel Truck", 1, 0)
M$PICKUP <- ifelse(M$CAR_TYPE == "Pickup", 1, 0)
M$SPORTS <- ifelse(M$CAR_TYPE == "Sports Car", 1, 0)
M$VAN <- ifelse(M$CAR_TYPE == "Van", 1, 0)
M$SUV <- ifelse(M$CAR_TYPE == "z_SUV", 1, 0)
remove <- c("EDUCATION", "JOB", "CAR_TYPE")
X <- rbind(X, data.frame("TARGET_FLAG" = rep(T, ncol(M)-nrow(X)),
"TARGET_AMT" = rep(T, ncol(M)-nrow(X))))
X[match(remove, names(M)), ] <- F
Catergorial variables were changed to binary.
library(reshape2)
Corr_XY <- function(X, Y) {
corr <- data.frame(array(NA, dim = c(ncol(X), 5)))
colnames(corr) <- c("Y", "X", "r","p","<0.05")
for (i in 1:ncol(X)) {
r <- cor.test(Y[, 1], X[, i])
corr[i, 1] <- names(Y)
corr[i, 2] <- names(X[i])
corr[i, 3] <- r$estimate
corr[i, 4] <- r$p.value
corr[i, 5] <- corr[i, 4] < 0.05
}
return(corr)
}
Corr_XX <- function(X, threshold) {
corr <- data.frame(array(NA, dim = c(choose(ncol(X), 2), 5)))
colnames(corr) <- c("X1", "X2", "r","p","<0.05"); k = 1
for (i in 1:(ncol(X) - 1)) {
for (j in (i+1):ncol(X)) {
r <- cor.test(X[,i], X[,j])
corr[k, 1] <- names(X[i])
corr[k, 2] <- names(X[j])
corr[k, 3] <- r$estimate
corr[k, 4] <- r$p.value
corr[k, 5] <- corr[i, 4] < 0.05
k = k + 1
}
}
least <- corr[corr[,"<0.05"] == F, ]
most <- corr[abs(corr[,"r"]) >= threshold, ]
result <- list("Correlations" = corr, "Least_Correlated"= least, "Most_Correlated" = most)
return(result)
}
The specification M[1:n, -c(1:3, categorical)]
creates a data frame excluding the INDEX
, TARGET_FLAG
, TARGET_AMT
, and categorical variables. The specification M[1:n, 2, drop = FALSE]
creates a data frame with the \(Y\) of interest and retains the column name.
correlations <- Corr_XY(M[1:n, -c(1:3, categorical)], M[1:n, 3, drop = FALSE])
display(correlations)
The predictor variables SEX
, BLUEBOOK
, RED_CAR
, BACHELORS
, CLERICAL
, HOME_MAKER
, PROF
, LAWYER
, YOJ
, and SUV
do not have statistically significant correlations with the response variable and are therefore not being considered for the model. The variable YOJ
sits at the threshold of statistical viability, and will be left in.
remove <- c("SEX", "BLUEBOOK", "RED_CAR", "BACHELORS", "CLERICAL", "HOME_MAKER", "PROF", "LAWYER", "YOJ", "SUV")
X[match(remove, names(M)), "TARGET_AMT"] <- F
The specification M[1:n, -c(1:3, categorical)]
creates a data frame excluding the INDEX
, TARGET_FLAG
, TARGET_AMT
, and categorical variables. The specification M[1:n, 2, drop = FALSE]
creates a data frame with the \(Y\) of interest and retains the column name.
correlations <- Corr_XY(M[1:n, -c(1:3, categorical)], M[1:n, 2, drop = FALSE])
display(correlations)
The predictor variables SEX
, RED_CAR
, HOME_MAKER
, TRUCK
, and VAN
do not have statistically significant correlations with the response variable and are therefore not being considered for the model.
remove <- c("SEX", "RED_CAR", "HOME_MAKER", "TRUCK", "VAN")
X[match(remove, names(M)), "TARGET_FLAG"] <- F
correlations <- Corr_XX(M[1:n, (X[,"TARGET_AMT"] & X[,"TARGET_FLAG"])], 0.50)
display(correlations$Least_Correlated)
display(correlations$Most_Correlated)
The specification M[1:n, (X[,"TARGET_AMT"] & X[,"TARGET_FLAG"])]
creates a data frame excluding INDEX
, TARGET_FLAG
, TARGET_AMT
, and the variables previously marked for removal due to statistically significant correlations with the response variable. There are strong statistically significant correlations between HOME_VAL
& INCOME
, PHD
& DOCTOR
, and MASTERS
& LAWYER
. From these paired correlated variables we find that HOME_VAL
, DOCTOR
, and LAWYER
are least correlated to both TARGET_FLAG
and TARGET_AMT
. These three variables will therefore not be considered for the model. It is worth noting that the high correlation between PHD
& DOCTOR
and MASTERS
& LAWYER
is likely due to prior imputation.
remove <- c("HOME_VAL", "DOCTOR", "LAWYER")
X[match(remove, names(M)), ] <- F
library(MASS)
columns <- c("INCOME", "HOME_VAL", "TRAVTIME", "BLUEBOOK", "TIF", "MVR_PTS")
fit_exp <- function(X, fields) {
potential <- match(fields, names(X))
lambda <- numeric(ncol(X))
par(mfrow=c(2,3))
for (i in potential) {
shifted <- X[, i] - min(X[, i]) + 1e-32
fit_exp <- fitdistr(shifted, "Exponential")
lambda[i] <- fit_exp$estimate
exp <- rexp(1000, lambda[i])
hist(X[, i], prob=TRUE, col="grey", main =names(X[i]),
xlab=paste("Lambda =",fractions(lambda[i])))
lines(density(exp), col="blue", lwd=2)
}
lambda <- data.frame("VARIABLE"=fields, "LAMBDA"=lambda[potential])
return(lambda)
}
lambda <- fit_exp(M[1:n, ], columns)
lambda
## VARIABLE LAMBDA
## 1 INCOME 1.615559e-05
## 2 HOME_VAL 5.868625e-06
## 3 TRAVTIME 3.510530e-02
## 4 BLUEBOOK 7.037347e-05
## 5 TIF 2.298161e-01
## 6 MVR_PTS 5.897955e-01
Five of the six potential variables lend themselves toward modeling with an exponential distribution. The variables were shifted to slightly above zero by subtracting the minimum value and then adding \(1^{-32}\) to the modified value. This would also shift data with a negative minimum in the appropriate direction since subtracting the negative minimum value equates to adding the minimum value.
M[, "log_INCOME"] <- log(M[, "INCOME"] - min(M[, "INCOME"]) + 1e-32, lambda[1,2])
M[, "log_TRAVTIME"] <- log(M[, "TRAVTIME"] - min(M[, "TRAVTIME"]) + 1e-32, lambda[3,2])
M[, "log_BLUEBOOK"] <- log(M[, "BLUEBOOK"] - min(M[, "BLUEBOOK"]) + 1e-32, lambda[4,2])
M[, "log_TIF"] <- log(M[, "TIF"] - min(M[, "TIF"]) + 1e-32, lambda[5,2])
M[, "log_MVR_PTS"] <- log(M[, "MVR_PTS"] - min(M[, "MVR_PTS"]) + 1e-32, lambda[6,2])
remove <- c("INCOME", "TRAVTIME", "BLUEBOOK", "TIF", "MVR_PTS")
X <- rbind(X, data.frame("TARGET_FLAG" = rep(T, ncol(M)-nrow(X)),
"TARGET_AMT" = rep(T, ncol(M)-nrow(X))))
X[match(remove, names(M)), ] <- F
par(mfrow=c(2,3))
smoothScatter(M[1:n, "KIDSDRIV"], ylab = "KIDSDRIV")
smoothScatter(M[1:n, "HOMEKIDS"], ylab = "HOMEKIDS")
smoothScatter(M[1:n, "YOJ"], ylab = "YOJ")
smoothScatter(M[1:n, "OLDCLAIM"], ylab = "OLDCLAIM")
smoothScatter(M[1:n, "CLM_FREQ"], ylab = "CLM_FREQ")
smoothScatter(M[1:n, "CAR_AGE"], ylab = "CAR_AGE")
The variables KIDSDRIV
, HOMEKIDS
, YOJ
, OLDCLAIM
, CLM_FREQ
, and CAR_AGE
have bimodal distributions. There are clear lines of demarcation in the values that we can use to bifurcate the variables into categories. The defining value for each of these variables is zero. Therefore, we can categorize the variables as zero if the value is equal to zero, and one otherwise.
M[,"cat_KIDSDRIV"] <- ifelse(M$KIDSDRIV == 0, 0, 1)
M[,"cat_HOMEKIDS"] <- ifelse(M$HOMEKIDS == 0, 0, 1)
M[,"cat_YOJ"] <- ifelse(M$YOJ == 0, 0, 1)
M[,"cat_OLDCLAIM"] <- ifelse(M$OLDCLAIM == 0, 0, 1)
M[,"cat_CLM_FREQ"] <- ifelse(M$CLM_FREQ == 0, 0, 1)
M[,"cat_CAR_AGE"] <- ifelse(M$CAR_AGE == 0, 0, 1)
remove <- c("KIDSDRIV", "HOMEKIDS", "YOJ", "OLDCLAIM", "CLM_FREQ", "CAR_AGE")
X <- rbind(X, data.frame("TARGET_FLAG" = rep(T, ncol(M)-nrow(X)),
"TARGET_AMT" = rep(T, ncol(M)-nrow(X))))
X[match(remove, names(M)), ] <- F
library(car)
potential <- match(c("AGE"), names(M))
box.cox.powers <- powerTransform(M[1:n, potential], family="bcPower")
summary(box.cox.powers)
## bcPower Transformation to Normality
## Est.Power Std.Err. Wald Lower Bound Wald Upper Bound
## M[1:n, potential] 1.0391 0.0454 0.95 1.1282
##
## Likelihood ratio tests about transformation parameters
## LRT df pval
## LR test, lambda = (0) 559.6780078 1 0.000000
## LR test, lambda = (1) 0.7421689 1 0.388967
The only unexamined variable that the Box-Cox Transformation could potentially be applied to is the quantitative variable AGE
. However, upon examination, examination AGE
returns an estimated power close to one which indicates that no transformation is necessary. This is further supported by the boundaries which include the value of one in the range.
Using the training data set, build at least two different multiple linear regression (MLR) models and three different binary logistic regression (BLR) models, using different variables (or the same variables with different transformations). You may select the variables manually, use an approach such as Forward or Stepwise, use a different approach such as trees, or use a combination of techniques. Describe the techniques you used. If you manually selected a variable for inclusion into the model or exclusion into the model, indicate why this was done.
Discuss the coefficients in the models, do they make sense? For example, if a person has a lot of traffic tickets, you would reasonably expect that person to have more car crashes. If the coefficient is negative (suggesting that the person is a safer driver), then that needs to be discussed. Are you keeping the model even though it is counter intuitive? Why? The boss needs to know.
training_AMT <- M[1:n, X[,"TARGET_AMT"]]
training_FLAG <- M[1:n, X[,"TARGET_FLAG"]]
Forward stepwise subset selection based on \(AIC\). Using \(k = 2\) degrees of freedom for the penalty gives the genuine \(AIC\). Using \(k = log(n)\) is sometimes referred to as BIC or SBC.
null <- lm(TARGET_AMT ~ 0, training_AMT)
full <- lm(TARGET_AMT ~ ., training_AMT)
aic_steps <- step(null, scope=list(lower=null, upper=full), direction="forward", k = 2, trace=F)
aic_steps$call
## lm(formula = TARGET_AMT ~ URBANICITY + cat_OLDCLAIM + PARENT1 +
## MANAGER + CAR_USE + cat_CAR_AGE + MINIVAN + log_MVR_PTS +
## MSTATUS + cat_KIDSDRIV + log_TIF + NOHS + HS + REVOKED +
## SPORTS + log_INCOME + log_TRAVTIME - 1, data = training_AMT)
The above model has the lowest AIC.
forward_AMT <- lm(aic_steps$call, training_AMT)
round(coef(summary(forward_AMT)), 6)
## Estimate Std. Error t value Pr(>|t|)
## URBANICITY 1505.085315 136.313068 11.041387 0.000000
## cat_OLDCLAIM 552.793025 114.474396 4.828967 0.000001
## PARENT1 640.172763 177.545247 3.605688 0.000313
## MANAGER -851.066664 162.431235 -5.239551 0.000000
## CAR_USE -785.064181 111.841871 -7.019412 0.000000
## cat_CAR_AGE 761.221081 182.601816 4.168749 0.000031
## MINIVAN -470.879470 121.461053 -3.876794 0.000107
## log_MVR_PTS -2.769052 0.757178 -3.657068 0.000257
## MSTATUS -585.805201 119.630701 -4.896780 0.000001
## cat_KIDSDRIV 692.326953 162.330292 4.264928 0.000020
## log_TIF 6.991248 2.130387 3.281680 0.001036
## NOHS 657.591808 153.642482 4.280013 0.000019
## HS 470.297182 120.908400 3.889698 0.000101
## REVOKED 471.513352 155.133315 3.039407 0.002378
## SPORTS 321.992205 168.356009 1.912567 0.055839
## log_INCOME 45.990011 25.543071 1.800489 0.071820
## log_TRAVTIME -18.650708 11.105457 -1.679418 0.093109
At a significance level of \(\alpha=0.5\), the 17 forward selected variables yield a TARGET_AMT
MLR model with three insignificant variables: SPORTS
, log_INCOME
, and log_TRAVTIME
. Removing those three insignificant variables yields a model with all significant variables.
forward_AMT <- lm(TARGET_AMT ~ URBANICITY + cat_OLDCLAIM + PARENT1 +
MANAGER + CAR_USE + MINIVAN + log_MVR_PTS + MSTATUS + cat_CAR_AGE +
cat_KIDSDRIV + log_TIF + NOHS + HS + REVOKED - 1, training_AMT)
summary(forward_AMT)
##
## Call:
## lm(formula = TARGET_AMT ~ URBANICITY + cat_OLDCLAIM + PARENT1 +
## MANAGER + CAR_USE + MINIVAN + log_MVR_PTS + MSTATUS + cat_CAR_AGE +
## cat_KIDSDRIV + log_TIF + NOHS + HS + REVOKED - 1, data = training_AMT)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4997 -1694 -818 354 104984
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## URBANICITY 1469.0432 135.8182 10.816 < 2e-16 ***
## cat_OLDCLAIM 572.1325 114.3497 5.003 5.75e-07 ***
## PARENT1 655.2860 177.4451 3.693 0.000223 ***
## MANAGER -884.0253 162.0108 -5.457 5.00e-08 ***
## CAR_USE -729.9259 109.4039 -6.672 2.69e-11 ***
## MINIVAN -540.1926 117.7244 -4.589 4.53e-06 ***
## log_MVR_PTS -2.8047 0.7574 -3.703 0.000214 ***
## MSTATUS -581.0082 119.6464 -4.856 1.22e-06 ***
## cat_CAR_AGE 772.8248 181.7666 4.252 2.14e-05 ***
## cat_KIDSDRIV 683.0515 162.3725 4.207 2.62e-05 ***
## log_TIF 7.1675 2.1301 3.365 0.000769 ***
## NOHS 676.0476 153.2975 4.410 1.05e-05 ***
## HS 496.4631 120.3608 4.125 3.75e-05 ***
## REVOKED 474.5687 155.1805 3.058 0.002234 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4560 on 8147 degrees of freedom
## Multiple R-squared: 0.1488, Adjusted R-squared: 0.1473
## F-statistic: 101.7 on 14 and 8147 DF, p-value: < 2.2e-16
null <- glm(TARGET_FLAG ~ 0, family = binomial(link = "logit"), training_FLAG)
full <- glm(TARGET_FLAG ~ ., family = binomial(link = "logit"), training_FLAG)
aic_steps <- step(null, scope=list(lower=null, upper=full), direction="forward", k = 2, trace=F)
aic_steps$aic
## [1] 7500.651
aic_steps$formula
## TARGET_FLAG ~ AGE + URBANICITY + cat_OLDCLAIM + CAR_USE + cat_YOJ +
## MSTATUS + MANAGER + MINIVAN + REVOKED + cat_KIDSDRIV + HS +
## NOHS + cat_CAR_AGE + log_TIF + log_MVR_PTS + cat_HOMEKIDS +
## SPORTS + log_TRAVTIME + SUV + PHD + log_INCOME + CLERICAL +
## PICKUP + log_BLUEBOOK + BLUE_COLLAR + PARENT1 - 1
The above model has the lowest AIC.
forward_FLAG <- glm(aic_steps$formula, family = binomial(link = "logit"), training_FLAG)
round(coef(summary(forward_FLAG)), 6)
## Estimate Std. Error z value Pr(>|z|)
## AGE -0.003424 0.003925 -0.872196 0.383101
## URBANICITY 2.224240 0.110518 20.125582 0.000000
## cat_OLDCLAIM 0.553906 0.060688 9.127162 0.000000
## CAR_USE -0.729315 0.081633 -8.934078 0.000000
## cat_YOJ -0.000101 0.283764 -0.000355 0.999717
## MSTATUS -0.692176 0.074923 -9.238513 0.000000
## MANAGER -0.712284 0.107968 -6.597173 0.000000
## MINIVAN -0.418051 0.105548 -3.960787 0.000075
## REVOKED 0.729757 0.079316 9.200623 0.000000
## cat_KIDSDRIV 0.556026 0.096696 5.750221 0.000000
## HS 0.513970 0.075100 6.843797 0.000000
## NOHS 0.546576 0.097037 5.632669 0.000000
## cat_CAR_AGE -2.817303 0.329517 -8.549807 0.000000
## log_TIF 0.007683 0.001190 6.459280 0.000000
## log_MVR_PTS -0.002206 0.000437 -5.052844 0.000000
## cat_HOMEKIDS 0.224665 0.096231 2.334635 0.019563
## SPORTS 0.636551 0.119359 5.333084 0.000000
## log_TRAVTIME -0.028562 0.007067 -4.041575 0.000053
## SUV 0.384536 0.100159 3.839250 0.000123
## PHD -0.314215 0.115890 -2.711326 0.006701
## log_INCOME 0.106774 0.037340 2.859542 0.004243
## CLERICAL 0.325682 0.097022 3.356792 0.000789
## PICKUP 0.223960 0.095602 2.342627 0.019149
## log_BLUEBOOK 0.061436 0.023980 2.561970 0.010408
## BLUE_COLLAR 0.188304 0.090050 2.091115 0.036518
## PARENT1 0.215201 0.118922 1.809608 0.070357
At a significance level of \(\alpha=0.5\), the 26 forward selected variables yield a TARGET_FLAG
BLR model with three insignificant variables: AGE
, cat_YOJ
, and PARENT1
. Removing those three insignificant variables yields a model with all significant variables.
forward_FLAG <- glm(TARGET_FLAG ~ URBANICITY + cat_OLDCLAIM + CAR_USE +
MSTATUS + MANAGER + MINIVAN + REVOKED + cat_KIDSDRIV + HS +
NOHS + cat_CAR_AGE + log_TIF + log_MVR_PTS + cat_HOMEKIDS +
SPORTS + log_TRAVTIME + SUV + PHD + log_INCOME + CLERICAL +
PICKUP + log_BLUEBOOK + BLUE_COLLAR - 1, family = binomial(link = "logit"), training_FLAG)
summary(forward_FLAG)
##
## Call:
## glm(formula = TARGET_FLAG ~ URBANICITY + cat_OLDCLAIM + CAR_USE +
## MSTATUS + MANAGER + MINIVAN + REVOKED + cat_KIDSDRIV + HS +
## NOHS + cat_CAR_AGE + log_TIF + log_MVR_PTS + cat_HOMEKIDS +
## SPORTS + log_TRAVTIME + SUV + PHD + log_INCOME + CLERICAL +
## PICKUP + log_BLUEBOOK + BLUE_COLLAR - 1, family = binomial(link = "logit"),
## data = training_FLAG)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.1946 -0.7273 -0.4224 0.6889 3.1402
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## URBANICITY 2.2185007 0.1101897 20.133 < 2e-16 ***
## cat_OLDCLAIM 0.5511070 0.0606483 9.087 < 2e-16 ***
## CAR_USE -0.7312157 0.0815656 -8.965 < 2e-16 ***
## MSTATUS -0.7827185 0.0585965 -13.358 < 2e-16 ***
## MANAGER -0.7090159 0.1077946 -6.577 4.79e-11 ***
## MINIVAN -0.4156174 0.1055195 -3.939 8.19e-05 ***
## REVOKED 0.7302799 0.0793086 9.208 < 2e-16 ***
## cat_KIDSDRIV 0.5302165 0.0941368 5.632 1.78e-08 ***
## HS 0.5146327 0.0749761 6.864 6.70e-12 ***
## NOHS 0.5475820 0.0968129 5.656 1.55e-08 ***
## cat_CAR_AGE -2.9259596 0.1395232 -20.971 < 2e-16 ***
## log_TIF 0.0076797 0.0011890 6.459 1.06e-10 ***
## log_MVR_PTS -0.0022271 0.0004362 -5.106 3.29e-07 ***
## cat_HOMEKIDS 0.3568313 0.0683715 5.219 1.80e-07 ***
## SPORTS 0.6357416 0.1191917 5.334 9.62e-08 ***
## log_TRAVTIME -0.0283289 0.0070487 -4.019 5.84e-05 ***
## SUV 0.3844424 0.1001565 3.838 0.000124 ***
## PHD -0.3247954 0.1155157 -2.812 0.004928 **
## log_INCOME 0.1076736 0.0150112 7.173 7.34e-13 ***
## CLERICAL 0.3327295 0.0966097 3.444 0.000573 ***
## PICKUP 0.2252018 0.0956307 2.355 0.018527 *
## log_BLUEBOOK 0.0618683 0.0239369 2.585 0.009748 **
## BLUE_COLLAR 0.1911011 0.0899125 2.125 0.033552 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 11313.5 on 8161 degrees of freedom
## Residual deviance: 7452.6 on 8138 degrees of freedom
## AIC: 7498.6
##
## Number of Fisher Scoring iterations: 5
Backward stepwise subset elimination based on \(AIC\). Using \(k = 2\) degrees of freedom for the penalty gives the genuine \(AIC\). Using \(k = log(n)\) is sometimes referred to as BIC or SBC.
null <- lm(TARGET_AMT ~ 0, training_AMT)
full <- lm(TARGET_AMT ~ ., training_AMT)
aic_steps <- step(full, scope=list(lower=null, upper=full), direction="backward", k = 2, trace=F)
aic_steps$call
## lm(formula = TARGET_AMT ~ PARENT1 + MSTATUS + CAR_USE + REVOKED +
## URBANICITY + HS + NOHS + MANAGER + MINIVAN + SPORTS + log_INCOME +
## log_TRAVTIME + log_TIF + log_MVR_PTS + cat_KIDSDRIV + cat_OLDCLAIM,
## data = training_AMT)
The above model has the lowest AIC.
backward_AMT <- lm(aic_steps$call, training_AMT)
round(coef(summary(backward_AMT)), 6)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 762.784417 183.151125 4.164782 0.000031
## PARENT1 640.018132 177.569345 3.604328 0.000315
## MSTATUS -586.551921 119.723526 -4.899220 0.000001
## CAR_USE -785.707613 111.900046 -7.021513 0.000000
## REVOKED 470.841555 155.143313 3.034881 0.002414
## URBANICITY 1505.218992 136.333458 11.040716 0.000000
## HS 470.034608 120.940482 3.886495 0.000103
## NOHS 657.777887 153.639789 4.281299 0.000019
## MANAGER -851.651499 162.435546 -5.243012 0.000000
## MINIVAN -471.449326 121.477290 -3.880967 0.000105
## SPORTS 322.212013 168.353540 1.913901 0.055668
## log_INCOME 46.035671 25.544003 1.802210 0.071549
## log_TRAVTIME -18.636693 11.105620 -1.678132 0.093360
## log_TIF 6.989195 2.130538 3.280483 0.001041
## log_MVR_PTS -2.773052 0.757597 -3.660325 0.000253
## cat_KIDSDRIV 691.794967 162.331388 4.261622 0.000021
## cat_OLDCLAIM 552.008131 114.513722 4.820454 0.000001
At a significance level of \(\alpha=0.5\), the 17 backward selected variables yield a TARGET_AMT
MLR model with three insignificant variables: SPORTS
, log_INCOME
, and log_TRAVTIME
. Removing those three insignificant variables yields a model with all significant variables.
backward_AMT <- lm(TARGET_AMT ~ PARENT1 + MSTATUS + CAR_USE + REVOKED +
URBANICITY + HS + NOHS + MANAGER + MINIVAN + log_TIF +
log_MVR_PTS + cat_KIDSDRIV + cat_OLDCLAIM, training_AMT)
summary(backward_AMT)
##
## Call:
## lm(formula = TARGET_AMT ~ PARENT1 + MSTATUS + CAR_USE + REVOKED +
## URBANICITY + HS + NOHS + MANAGER + MINIVAN + log_TIF + log_MVR_PTS +
## cat_KIDSDRIV + cat_OLDCLAIM, data = training_AMT)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4997 -1694 -819 354 104984
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 774.2430 182.3082 4.247 2.19e-05 ***
## PARENT1 655.1922 177.4678 3.692 0.000224 ***
## MSTATUS -581.7175 119.7388 -4.858 1.21e-06 ***
## CAR_USE -730.5026 109.4617 -6.674 2.66e-11 ***
## REVOKED 473.8986 155.1903 3.054 0.002268 **
## URBANICITY 1469.2228 135.8413 10.816 < 2e-16 ***
## HS 496.2580 120.3904 4.122 3.79e-05 ***
## NOHS 676.2812 153.2939 4.412 1.04e-05 ***
## MANAGER -884.6343 162.0158 -5.460 4.90e-08 ***
## MINIVAN -540.8132 117.7404 -4.593 4.43e-06 ***
## log_TIF 7.1657 2.1302 3.364 0.000772 ***
## log_MVR_PTS -2.8086 0.7578 -3.706 0.000212 ***
## cat_KIDSDRIV 682.5059 162.3736 4.203 2.66e-05 ***
## cat_OLDCLAIM 571.3661 114.3887 4.995 6.01e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4560 on 8147 degrees of freedom
## Multiple R-squared: 0.0617, Adjusted R-squared: 0.0602
## F-statistic: 41.21 on 13 and 8147 DF, p-value: < 2.2e-16
null <- glm(TARGET_FLAG ~ 0, family = binomial(link = "logit"), training_FLAG)
full <- glm(TARGET_FLAG ~ ., family = binomial(link = "logit"), training_FLAG)
aic_steps <- step(full, scope=list(lower=null, upper=full), direction="backward", k = 2, trace=F)
aic_steps$formula
## TARGET_FLAG ~ PARENT1 + MSTATUS + CAR_USE + REVOKED + URBANICITY +
## PHD + MASTERS + BACHELORS + CLERICAL + MANAGER + STUDENT +
## BLUE_COLLAR + MINIVAN + PICKUP + SPORTS + SUV + log_INCOME +
## log_TRAVTIME + log_BLUEBOOK + log_TIF + log_MVR_PTS + cat_KIDSDRIV +
## cat_HOMEKIDS + cat_OLDCLAIM
The above model has the lowest AIC.
backward_FLAG <- glm(aic_steps$formula, family = binomial(link = "logit"), training_FLAG)
round(coef(summary(backward_FLAG)), 6)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.554923 0.160423 -15.926206 0.000000
## PARENT1 0.217757 0.119008 1.829758 0.067286
## MSTATUS -0.694028 0.074400 -9.328352 0.000000
## CAR_USE -0.688893 0.080778 -8.528279 0.000000
## REVOKED 0.733365 0.079366 9.240333 0.000000
## URBANICITY 2.249695 0.111388 20.196878 0.000000
## PHD -0.791338 0.133765 -5.915870 0.000000
## MASTERS -0.508002 0.105529 -4.813852 0.000001
## BACHELORS -0.485927 0.077640 -6.258734 0.000000
## CLERICAL 0.382111 0.106668 3.582245 0.000341
## MANAGER -0.690384 0.109689 -6.293993 0.000000
## STUDENT 0.184715 0.126539 1.459743 0.144361
## BLUE_COLLAR 0.250666 0.103458 2.422876 0.015398
## MINIVAN -0.435240 0.105575 -4.122578 0.000037
## PICKUP 0.219309 0.096226 2.279105 0.022661
## SPORTS 0.616909 0.118944 5.186530 0.000000
## SUV 0.364174 0.099838 3.647647 0.000265
## log_INCOME 0.100999 0.015677 6.442715 0.000000
## log_TRAVTIME -0.028743 0.007072 -4.064134 0.000048
## log_BLUEBOOK 0.059508 0.024005 2.479025 0.013174
## log_TIF 0.007729 0.001190 6.493015 0.000000
## log_MVR_PTS -0.002189 0.000437 -5.008679 0.000001
## cat_KIDSDRIV 0.542256 0.094714 5.725173 0.000000
## cat_HOMEKIDS 0.252943 0.087625 2.886646 0.003894
## cat_OLDCLAIM 0.557205 0.060751 9.171972 0.000000
At a significance level of \(\alpha=0.5\), the 25 backward selected variables yield a TARGET_FLAG
BLR model with three insignificant variables: PARENT1
and STUDENT
. Removing those two insignificant variables yields a model with all significant variables.
backward_FLAG <- glm(TARGET_FLAG ~ MSTATUS + CAR_USE + REVOKED + URBANICITY +
PHD + MASTERS + BACHELORS + CLERICAL + MANAGER +
BLUE_COLLAR + MINIVAN + PICKUP + SPORTS + SUV + log_INCOME +
log_TRAVTIME + log_BLUEBOOK + log_TIF + log_MVR_PTS + cat_KIDSDRIV +
cat_HOMEKIDS + cat_OLDCLAIM, family = binomial(link = "logit"), training_FLAG)
summary(backward_FLAG)
##
## Call:
## glm(formula = TARGET_FLAG ~ MSTATUS + CAR_USE + REVOKED + URBANICITY +
## PHD + MASTERS + BACHELORS + CLERICAL + MANAGER + BLUE_COLLAR +
## MINIVAN + PICKUP + SPORTS + SUV + log_INCOME + log_TRAVTIME +
## log_BLUEBOOK + log_TIF + log_MVR_PTS + cat_KIDSDRIV + cat_HOMEKIDS +
## cat_OLDCLAIM, family = binomial(link = "logit"), data = training_FLAG)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.2016 -0.7231 -0.4212 0.6891 3.1351
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.4312711 0.1486779 -16.353 < 2e-16 ***
## MSTATUS -0.7777777 0.0586439 -13.263 < 2e-16 ***
## CAR_USE -0.7155003 0.0787814 -9.082 < 2e-16 ***
## REVOKED 0.7343615 0.0793348 9.256 < 2e-16 ***
## URBANICITY 2.2363335 0.1109238 20.161 < 2e-16 ***
## PHD -0.8497653 0.1275975 -6.660 2.74e-11 ***
## MASTERS -0.5639725 0.0979155 -5.760 8.42e-09 ***
## BACHELORS -0.5101645 0.0752438 -6.780 1.20e-11 ***
## CLERICAL 0.3256558 0.0990172 3.289 0.00101 **
## MANAGER -0.7098209 0.1082712 -6.556 5.53e-11 ***
## BLUE_COLLAR 0.1852830 0.0928333 1.996 0.04595 *
## MINIVAN -0.4147417 0.1048748 -3.955 7.67e-05 ***
## PICKUP 0.2317109 0.0958995 2.416 0.01568 *
## SPORTS 0.6331507 0.1186074 5.338 9.39e-08 ***
## SUV 0.3802718 0.0993833 3.826 0.00013 ***
## log_INCOME 0.1067877 0.0151806 7.034 2.00e-12 ***
## log_TRAVTIME -0.0285046 0.0070522 -4.042 5.30e-05 ***
## log_BLUEBOOK 0.0604612 0.0239746 2.522 0.01167 *
## log_TIF 0.0077168 0.0011898 6.486 8.82e-11 ***
## log_MVR_PTS -0.0021850 0.0004367 -5.004 5.63e-07 ***
## cat_KIDSDRIV 0.5297128 0.0942272 5.622 1.89e-08 ***
## cat_HOMEKIDS 0.3609836 0.0685862 5.263 1.42e-07 ***
## cat_OLDCLAIM 0.5555339 0.0607168 9.150 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 9418.0 on 8160 degrees of freedom
## Residual deviance: 7446.2 on 8138 degrees of freedom
## AIC: 7492.2
##
## Number of Fisher Scoring iterations: 5
library(leaps)
model_sum_AMT <- summary(regsubsets(TARGET_AMT ~ ., training_AMT, nvmax=ncol(training_AMT)))
## Reordering variables and trying again:
model_sum_FLAG <- summary(regsubsets(TARGET_FLAG ~ ., training_FLAG, nvmax=ncol(training_FLAG)))
## Reordering variables and trying again:
par(mfrow=c(1,2))
plot(model_sum_AMT$adjr2, xlab = "Number of Variables", ylab = "Adj R-squared", main="TARGET_AMT")
plot(model_sum_FLAG$adjr2, xlab = "Number of Variables", ylab = "Adj R-squared", main="TARGET_FLAG")
cbind(max(model_sum_AMT$adjr2), which.max(model_sum_AMT$adjr2))
## [,1] [,2]
## [1,] 0.06123634 19
cbind(max(model_sum_FLAG$adjr2), which.max(model_sum_FLAG$adjr2))
## [,1] [,2]
## [1,] 0.211793 24
The maximum Adjusted \(R^2\) of 0.0612363 for the model predicting TARGET_AMT
is reached when the model contains 19 variables.
model_sum_AMT$which[which.max(model_sum_AMT$adjr2), ]
## (Intercept) AGE PARENT1 MSTATUS CAR_USE
## TRUE FALSE TRUE TRUE TRUE
## REVOKED URBANICITY PHD MASTERS HS
## TRUE TRUE TRUE FALSE TRUE
## NOHS MANAGER STUDENT BLUE_COLLAR MINIVAN
## TRUE TRUE FALSE FALSE TRUE
## TRUCK PICKUP SPORTS VAN log_INCOME
## FALSE TRUE TRUE FALSE TRUE
## log_TRAVTIME log_BLUEBOOK log_TIF log_MVR_PTS cat_KIDSDRIV
## TRUE FALSE TRUE TRUE TRUE
## cat_HOMEKIDS cat_YOJ cat_OLDCLAIM cat_CLM_FREQ cat_CAR_AGE
## TRUE FALSE FALSE TRUE FALSE
adjustedr2_AMT <- lm(TARGET_AMT ~ 1 + PARENT1 + MSTATUS + CAR_USE +
REVOKED + URBANICITY + PHD + HS + NOHS + MANAGER + MINIVAN + PICKUP +
SPORTS + log_INCOME + log_TRAVTIME + log_TIF + log_MVR_PTS +
cat_KIDSDRIV + cat_HOMEKIDS + cat_CLM_FREQ, training_AMT)
round(coef(summary(adjustedr2_AMT)), 6)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 855.350195 190.724913 4.484732 0.000007
## PARENT1 477.258229 217.200288 2.197319 0.028026
## MSTATUS -647.893956 127.782128 -5.070302 0.000000
## CAR_USE -803.418566 113.162687 -7.099677 0.000000
## REVOKED 464.971703 155.173002 2.996473 0.002739
## URBANICITY 1524.033915 136.731529 11.146178 0.000000
## PHD -252.479571 185.347297 -1.362197 0.173173
## HS 425.653740 124.349425 3.423046 0.000622
## NOHS 616.658877 157.387319 3.918098 0.000090
## MANAGER -843.520474 162.559687 -5.188989 0.000000
## MINIVAN -514.629399 126.582870 -4.065553 0.000048
## PICKUP -165.005969 145.517170 -1.133928 0.256858
## SPORTS 275.376716 171.928269 1.601695 0.109262
## log_INCOME 45.399087 25.548734 1.776960 0.075612
## log_TRAVTIME -18.536784 11.106373 -1.669022 0.095151
## log_TIF 7.028143 2.130404 3.298971 0.000975
## log_MVR_PTS -2.790571 0.757820 -3.682366 0.000233
## cat_KIDSDRIV 601.792137 177.410599 3.392087 0.000697
## cat_HOMEKIDS 179.571417 150.040877 1.196817 0.231413
## cat_CLM_FREQ 540.063343 114.635246 4.711146 0.000003
At a significance level of \(\alpha=0.5\), the \(R_{adj}^2\) selected variables yield a TARGET_AMT
MLR model with six insignificant variables: PHD
, PICKUP
, SPORTS
, og_INCOME
, log_TRAVTIME
, and cat_HOMEKIDS
. Removing those six insignificant variables yields a model with all significant variables.
adjustedr2_AMT <- lm(TARGET_AMT ~ 1 + PARENT1 + MSTATUS + CAR_USE +
REVOKED + URBANICITY + HS + NOHS + MANAGER + MINIVAN + log_TIF +
log_MVR_PTS + cat_KIDSDRIV + cat_CLM_FREQ, training_AMT)
summary(adjustedr2_AMT)
##
## Call:
## lm(formula = TARGET_AMT ~ 1 + PARENT1 + MSTATUS + CAR_USE + REVOKED +
## URBANICITY + HS + NOHS + MANAGER + MINIVAN + log_TIF + log_MVR_PTS +
## cat_KIDSDRIV + cat_CLM_FREQ, data = training_AMT)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4997 -1694 -819 354 104984
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 774.2430 182.3082 4.247 2.19e-05 ***
## PARENT1 655.1922 177.4678 3.692 0.000224 ***
## MSTATUS -581.7175 119.7388 -4.858 1.21e-06 ***
## CAR_USE -730.5026 109.4617 -6.674 2.66e-11 ***
## REVOKED 473.8986 155.1903 3.054 0.002268 **
## URBANICITY 1469.2228 135.8413 10.816 < 2e-16 ***
## HS 496.2580 120.3904 4.122 3.79e-05 ***
## NOHS 676.2812 153.2939 4.412 1.04e-05 ***
## MANAGER -884.6343 162.0158 -5.460 4.90e-08 ***
## MINIVAN -540.8132 117.7404 -4.593 4.43e-06 ***
## log_TIF 7.1657 2.1302 3.364 0.000772 ***
## log_MVR_PTS -2.8086 0.7578 -3.706 0.000212 ***
## cat_KIDSDRIV 682.5059 162.3736 4.203 2.66e-05 ***
## cat_CLM_FREQ 571.3661 114.3887 4.995 6.01e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4560 on 8147 degrees of freedom
## Multiple R-squared: 0.0617, Adjusted R-squared: 0.0602
## F-statistic: 41.21 on 13 and 8147 DF, p-value: < 2.2e-16
This model is identical to the model derived using Backward Elimination.
The maximum Adjusted \(R^2\) of 0.211793 for the model predicting TARGET_FLAG
is reached when the model contains 24 variables.
model_sum_FLAG$which[which.max(model_sum_FLAG$adjr2), ]
## (Intercept) AGE PARENT1 MSTATUS CAR_USE
## TRUE FALSE TRUE TRUE TRUE
## REVOKED URBANICITY PHD MASTERS BACHELORS
## TRUE TRUE TRUE TRUE TRUE
## HS NOHS CLERICAL MANAGER PROF
## FALSE FALSE TRUE TRUE FALSE
## STUDENT BLUE_COLLAR MINIVAN PICKUP SPORTS
## TRUE TRUE TRUE TRUE TRUE
## SUV log_INCOME log_TRAVTIME log_BLUEBOOK log_TIF
## TRUE TRUE TRUE TRUE TRUE
## log_MVR_PTS cat_KIDSDRIV cat_HOMEKIDS cat_YOJ cat_OLDCLAIM
## TRUE TRUE TRUE FALSE TRUE
## cat_CLM_FREQ cat_CAR_AGE
## FALSE FALSE
adjustedr2_FLAG <- glm(TARGET_FLAG ~ 1 + PARENT1 + MSTATUS + CAR_USE +
REVOKED + URBANICITY + PHD + MASTERS + BACHELORS + CLERICAL + MANAGER +
STUDENT + BLUE_COLLAR + MINIVAN + PICKUP + SPORTS + SUV + log_INCOME +
log_TRAVTIME + log_BLUEBOOK + log_TIF + log_MVR_PTS + cat_KIDSDRIV +
cat_HOMEKIDS + cat_CLM_FREQ, family = binomial(link = "logit"), training_FLAG)
round(coef(summary(adjustedr2_FLAG)), 6)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.554923 0.160423 -15.926206 0.000000
## PARENT1 0.217757 0.119008 1.829758 0.067286
## MSTATUS -0.694028 0.074400 -9.328352 0.000000
## CAR_USE -0.688893 0.080778 -8.528279 0.000000
## REVOKED 0.733365 0.079366 9.240333 0.000000
## URBANICITY 2.249695 0.111388 20.196878 0.000000
## PHD -0.791338 0.133765 -5.915870 0.000000
## MASTERS -0.508002 0.105529 -4.813852 0.000001
## BACHELORS -0.485927 0.077640 -6.258734 0.000000
## CLERICAL 0.382111 0.106668 3.582245 0.000341
## MANAGER -0.690384 0.109689 -6.293993 0.000000
## STUDENT 0.184715 0.126539 1.459743 0.144361
## BLUE_COLLAR 0.250666 0.103458 2.422876 0.015398
## MINIVAN -0.435240 0.105575 -4.122578 0.000037
## PICKUP 0.219309 0.096226 2.279105 0.022661
## SPORTS 0.616909 0.118944 5.186530 0.000000
## SUV 0.364174 0.099838 3.647647 0.000265
## log_INCOME 0.100999 0.015677 6.442715 0.000000
## log_TRAVTIME -0.028743 0.007072 -4.064134 0.000048
## log_BLUEBOOK 0.059508 0.024005 2.479025 0.013174
## log_TIF 0.007729 0.001190 6.493015 0.000000
## log_MVR_PTS -0.002189 0.000437 -5.008679 0.000001
## cat_KIDSDRIV 0.542256 0.094714 5.725173 0.000000
## cat_HOMEKIDS 0.252943 0.087625 2.886646 0.003894
## cat_CLM_FREQ 0.557205 0.060751 9.171972 0.000000
At a significance level of \(\alpha=0.5\), the \(R_{adj}^2\) selected variables yield a TARGET_FLAG
BLR model with two insignificant variables: PARENT1
and STUDENT
. Removing those three insignificant variables yields a model with all significant variables.
adjustedr2_FLAG <- glm(TARGET_FLAG ~ 1 + MSTATUS + CAR_USE + REVOKED +
URBANICITY + PHD + MASTERS + BACHELORS + CLERICAL + MANAGER +
BLUE_COLLAR + MINIVAN + PICKUP + SPORTS + SUV + log_INCOME +
log_TRAVTIME + log_BLUEBOOK + log_TIF + log_MVR_PTS + cat_KIDSDRIV +
cat_HOMEKIDS + cat_CLM_FREQ, family = binomial(link = "logit"), training_FLAG)
summary(adjustedr2_FLAG)
##
## Call:
## glm(formula = TARGET_FLAG ~ 1 + MSTATUS + CAR_USE + REVOKED +
## URBANICITY + PHD + MASTERS + BACHELORS + CLERICAL + MANAGER +
## BLUE_COLLAR + MINIVAN + PICKUP + SPORTS + SUV + log_INCOME +
## log_TRAVTIME + log_BLUEBOOK + log_TIF + log_MVR_PTS + cat_KIDSDRIV +
## cat_HOMEKIDS + cat_CLM_FREQ, family = binomial(link = "logit"),
## data = training_FLAG)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.2016 -0.7231 -0.4212 0.6891 3.1351
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.4312711 0.1486779 -16.353 < 2e-16 ***
## MSTATUS -0.7777777 0.0586439 -13.263 < 2e-16 ***
## CAR_USE -0.7155003 0.0787814 -9.082 < 2e-16 ***
## REVOKED 0.7343615 0.0793348 9.256 < 2e-16 ***
## URBANICITY 2.2363335 0.1109238 20.161 < 2e-16 ***
## PHD -0.8497653 0.1275975 -6.660 2.74e-11 ***
## MASTERS -0.5639725 0.0979155 -5.760 8.42e-09 ***
## BACHELORS -0.5101645 0.0752438 -6.780 1.20e-11 ***
## CLERICAL 0.3256558 0.0990172 3.289 0.00101 **
## MANAGER -0.7098209 0.1082712 -6.556 5.53e-11 ***
## BLUE_COLLAR 0.1852830 0.0928333 1.996 0.04595 *
## MINIVAN -0.4147417 0.1048748 -3.955 7.67e-05 ***
## PICKUP 0.2317109 0.0958995 2.416 0.01568 *
## SPORTS 0.6331507 0.1186074 5.338 9.39e-08 ***
## SUV 0.3802718 0.0993833 3.826 0.00013 ***
## log_INCOME 0.1067877 0.0151806 7.034 2.00e-12 ***
## log_TRAVTIME -0.0285046 0.0070522 -4.042 5.30e-05 ***
## log_BLUEBOOK 0.0604612 0.0239746 2.522 0.01167 *
## log_TIF 0.0077168 0.0011898 6.486 8.82e-11 ***
## log_MVR_PTS -0.0021850 0.0004367 -5.004 5.63e-07 ***
## cat_KIDSDRIV 0.5297128 0.0942272 5.622 1.89e-08 ***
## cat_HOMEKIDS 0.3609836 0.0685862 5.263 1.42e-07 ***
## cat_CLM_FREQ 0.5555339 0.0607168 9.150 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 9418.0 on 8160 degrees of freedom
## Residual deviance: 7446.2 on 8138 degrees of freedom
## AIC: 7492.2
##
## Number of Fisher Scoring iterations: 5
This model is also identical to the model derived using Backward Elimination.
par(mfrow=c(1,2))
plot(model_sum_AMT$cp, xlab = "Number of Variables", ylab = "Mallows Cp", main="TARGET_AMT")
plot(model_sum_FLAG$cp, xlab = "Number of Variables", ylab = "Mallows Cp", main="TARGET_FLAG")
cbind(min(model_sum_AMT$cp), which.min(model_sum_AMT$cp))
## [,1] [,2]
## [1,] 12.81883 16
cbind(min(model_sum_FLAG$cp), which.min(model_sum_FLAG$cp))
## [,1] [,2]
## [1,] 19.25552 24
The minimum Mallows \(C_p\) 12.8188345 for the model predicting TARGET_AMT
is reached when the model contains 16 variables.
model_sum_AMT$which[which.min(model_sum_AMT$cp), ]
## (Intercept) AGE PARENT1 MSTATUS CAR_USE
## TRUE FALSE TRUE TRUE TRUE
## REVOKED URBANICITY PHD MASTERS HS
## TRUE TRUE FALSE FALSE TRUE
## NOHS MANAGER STUDENT BLUE_COLLAR MINIVAN
## TRUE TRUE FALSE FALSE TRUE
## TRUCK PICKUP SPORTS VAN log_INCOME
## FALSE FALSE TRUE FALSE TRUE
## log_TRAVTIME log_BLUEBOOK log_TIF log_MVR_PTS cat_KIDSDRIV
## TRUE FALSE TRUE TRUE TRUE
## cat_HOMEKIDS cat_YOJ cat_OLDCLAIM cat_CLM_FREQ cat_CAR_AGE
## FALSE FALSE TRUE FALSE FALSE
mallowscp_AMT <- lm(TARGET_AMT ~ 1 + PARENT1 + MSTATUS + CAR_USE +
REVOKED + URBANICITY + HS + NOHS + MANAGER + MINIVAN + SPORTS +
log_INCOME + log_TRAVTIME + log_TIF + log_MVR_PTS +
cat_KIDSDRIV + cat_CLM_FREQ, training_AMT)
round(coef(summary(mallowscp_AMT)), 6)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 762.784417 183.151125 4.164782 0.000031
## PARENT1 640.018132 177.569345 3.604328 0.000315
## MSTATUS -586.551921 119.723526 -4.899220 0.000001
## CAR_USE -785.707613 111.900046 -7.021513 0.000000
## REVOKED 470.841555 155.143313 3.034881 0.002414
## URBANICITY 1505.218992 136.333458 11.040716 0.000000
## HS 470.034608 120.940482 3.886495 0.000103
## NOHS 657.777887 153.639789 4.281299 0.000019
## MANAGER -851.651499 162.435546 -5.243012 0.000000
## MINIVAN -471.449326 121.477290 -3.880967 0.000105
## SPORTS 322.212013 168.353540 1.913901 0.055668
## log_INCOME 46.035671 25.544003 1.802210 0.071549
## log_TRAVTIME -18.636693 11.105620 -1.678132 0.093360
## log_TIF 6.989195 2.130538 3.280483 0.001041
## log_MVR_PTS -2.773052 0.757597 -3.660325 0.000253
## cat_KIDSDRIV 691.794967 162.331388 4.261622 0.000021
## cat_CLM_FREQ 552.008131 114.513722 4.820454 0.000001
At a significance level of \(\alpha=0.5\), the \(R_{adj}^2\) selected variables yield a TARGET_AMT
MLR model with three insignificant variables: SPORTS
, log_INCOME
, and log_TRAVTIME
. Removing those three insignificant variables yields a model with all significant variables.
mallowscp_AMT <- lm(TARGET_AMT ~ 1 + PARENT1 + MSTATUS + CAR_USE +
REVOKED + URBANICITY + HS + NOHS + MANAGER + MINIVAN + log_TIF +
log_MVR_PTS + cat_KIDSDRIV + cat_CLM_FREQ, training_AMT)
summary(mallowscp_AMT)
##
## Call:
## lm(formula = TARGET_AMT ~ 1 + PARENT1 + MSTATUS + CAR_USE + REVOKED +
## URBANICITY + HS + NOHS + MANAGER + MINIVAN + log_TIF + log_MVR_PTS +
## cat_KIDSDRIV + cat_CLM_FREQ, data = training_AMT)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4997 -1694 -819 354 104984
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 774.2430 182.3082 4.247 2.19e-05 ***
## PARENT1 655.1922 177.4678 3.692 0.000224 ***
## MSTATUS -581.7175 119.7388 -4.858 1.21e-06 ***
## CAR_USE -730.5026 109.4617 -6.674 2.66e-11 ***
## REVOKED 473.8986 155.1903 3.054 0.002268 **
## URBANICITY 1469.2228 135.8413 10.816 < 2e-16 ***
## HS 496.2580 120.3904 4.122 3.79e-05 ***
## NOHS 676.2812 153.2939 4.412 1.04e-05 ***
## MANAGER -884.6343 162.0158 -5.460 4.90e-08 ***
## MINIVAN -540.8132 117.7404 -4.593 4.43e-06 ***
## log_TIF 7.1657 2.1302 3.364 0.000772 ***
## log_MVR_PTS -2.8086 0.7578 -3.706 0.000212 ***
## cat_KIDSDRIV 682.5059 162.3736 4.203 2.66e-05 ***
## cat_CLM_FREQ 571.3661 114.3887 4.995 6.01e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4560 on 8147 degrees of freedom
## Multiple R-squared: 0.0617, Adjusted R-squared: 0.0602
## F-statistic: 41.21 on 13 and 8147 DF, p-value: < 2.2e-16
This model is also identical to the model derived using Backward Elimination.
The minimum Mallows \(C_p\) of 19.2555192 for the model predicting TARGET_FLAG
is reached when the model contains 24 variables.
model_sum_FLAG$which[which.min(model_sum_FLAG$cp), ]
## (Intercept) AGE PARENT1 MSTATUS CAR_USE
## TRUE FALSE TRUE TRUE TRUE
## REVOKED URBANICITY PHD MASTERS BACHELORS
## TRUE TRUE TRUE TRUE TRUE
## HS NOHS CLERICAL MANAGER PROF
## FALSE FALSE TRUE TRUE FALSE
## STUDENT BLUE_COLLAR MINIVAN PICKUP SPORTS
## TRUE TRUE TRUE TRUE TRUE
## SUV log_INCOME log_TRAVTIME log_BLUEBOOK log_TIF
## TRUE TRUE TRUE TRUE TRUE
## log_MVR_PTS cat_KIDSDRIV cat_HOMEKIDS cat_YOJ cat_OLDCLAIM
## TRUE TRUE TRUE FALSE TRUE
## cat_CLM_FREQ cat_CAR_AGE
## FALSE FALSE
mallowscp_FLAG <- glm(TARGET_FLAG ~ 1 + PARENT1 + MSTATUS + CAR_USE +
REVOKED + URBANICITY + PHD + MASTERS + BACHELORS + CLERICAL +
MANAGER + STUDENT + BLUE_COLLAR + MINIVAN + PICKUP + SPORTS +
SUV + log_INCOME + log_TRAVTIME + log_BLUEBOOK + log_TIF +
log_MVR_PTS + cat_KIDSDRIV + cat_HOMEKIDS + cat_CLM_FREQ,
family = binomial(link = "logit"), training_FLAG)
round(coef(summary(mallowscp_FLAG)), 6)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.554923 0.160423 -15.926206 0.000000
## PARENT1 0.217757 0.119008 1.829758 0.067286
## MSTATUS -0.694028 0.074400 -9.328352 0.000000
## CAR_USE -0.688893 0.080778 -8.528279 0.000000
## REVOKED 0.733365 0.079366 9.240333 0.000000
## URBANICITY 2.249695 0.111388 20.196878 0.000000
## PHD -0.791338 0.133765 -5.915870 0.000000
## MASTERS -0.508002 0.105529 -4.813852 0.000001
## BACHELORS -0.485927 0.077640 -6.258734 0.000000
## CLERICAL 0.382111 0.106668 3.582245 0.000341
## MANAGER -0.690384 0.109689 -6.293993 0.000000
## STUDENT 0.184715 0.126539 1.459743 0.144361
## BLUE_COLLAR 0.250666 0.103458 2.422876 0.015398
## MINIVAN -0.435240 0.105575 -4.122578 0.000037
## PICKUP 0.219309 0.096226 2.279105 0.022661
## SPORTS 0.616909 0.118944 5.186530 0.000000
## SUV 0.364174 0.099838 3.647647 0.000265
## log_INCOME 0.100999 0.015677 6.442715 0.000000
## log_TRAVTIME -0.028743 0.007072 -4.064134 0.000048
## log_BLUEBOOK 0.059508 0.024005 2.479025 0.013174
## log_TIF 0.007729 0.001190 6.493015 0.000000
## log_MVR_PTS -0.002189 0.000437 -5.008679 0.000001
## cat_KIDSDRIV 0.542256 0.094714 5.725173 0.000000
## cat_HOMEKIDS 0.252943 0.087625 2.886646 0.003894
## cat_CLM_FREQ 0.557205 0.060751 9.171972 0.000000
At a significance level of \(\alpha=0.5\), the \(R_{adj}^2\) selected variables yield a TARGET_FLAG
BLR model with two insignificant variables: PARENT1
and STUDENT
. Removing those three insignificant variables yields a model with all significant variables.
mallowscp_FLAG <- glm(TARGET_FLAG ~ 1 + MSTATUS + CAR_USE +
REVOKED + URBANICITY + PHD + MASTERS + BACHELORS + CLERICAL +
MANAGER + BLUE_COLLAR + MINIVAN + PICKUP + SPORTS + SUV +
log_INCOME + log_TRAVTIME + log_BLUEBOOK + log_TIF +
log_MVR_PTS + cat_KIDSDRIV + cat_HOMEKIDS + cat_CLM_FREQ,
family = binomial(link = "logit"), training_FLAG)
summary(mallowscp_FLAG)
##
## Call:
## glm(formula = TARGET_FLAG ~ 1 + MSTATUS + CAR_USE + REVOKED +
## URBANICITY + PHD + MASTERS + BACHELORS + CLERICAL + MANAGER +
## BLUE_COLLAR + MINIVAN + PICKUP + SPORTS + SUV + log_INCOME +
## log_TRAVTIME + log_BLUEBOOK + log_TIF + log_MVR_PTS + cat_KIDSDRIV +
## cat_HOMEKIDS + cat_CLM_FREQ, family = binomial(link = "logit"),
## data = training_FLAG)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.2016 -0.7231 -0.4212 0.6891 3.1351
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.4312711 0.1486779 -16.353 < 2e-16 ***
## MSTATUS -0.7777777 0.0586439 -13.263 < 2e-16 ***
## CAR_USE -0.7155003 0.0787814 -9.082 < 2e-16 ***
## REVOKED 0.7343615 0.0793348 9.256 < 2e-16 ***
## URBANICITY 2.2363335 0.1109238 20.161 < 2e-16 ***
## PHD -0.8497653 0.1275975 -6.660 2.74e-11 ***
## MASTERS -0.5639725 0.0979155 -5.760 8.42e-09 ***
## BACHELORS -0.5101645 0.0752438 -6.780 1.20e-11 ***
## CLERICAL 0.3256558 0.0990172 3.289 0.00101 **
## MANAGER -0.7098209 0.1082712 -6.556 5.53e-11 ***
## BLUE_COLLAR 0.1852830 0.0928333 1.996 0.04595 *
## MINIVAN -0.4147417 0.1048748 -3.955 7.67e-05 ***
## PICKUP 0.2317109 0.0958995 2.416 0.01568 *
## SPORTS 0.6331507 0.1186074 5.338 9.39e-08 ***
## SUV 0.3802718 0.0993833 3.826 0.00013 ***
## log_INCOME 0.1067877 0.0151806 7.034 2.00e-12 ***
## log_TRAVTIME -0.0285046 0.0070522 -4.042 5.30e-05 ***
## log_BLUEBOOK 0.0604612 0.0239746 2.522 0.01167 *
## log_TIF 0.0077168 0.0011898 6.486 8.82e-11 ***
## log_MVR_PTS -0.0021850 0.0004367 -5.004 5.63e-07 ***
## cat_KIDSDRIV 0.5297128 0.0942272 5.622 1.89e-08 ***
## cat_HOMEKIDS 0.3609836 0.0685862 5.263 1.42e-07 ***
## cat_CLM_FREQ 0.5555339 0.0607168 9.150 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 9418.0 on 8160 degrees of freedom
## Residual deviance: 7446.2 on 8138 degrees of freedom
## AIC: 7492.2
##
## Number of Fisher Scoring iterations: 5
This model is also identical to the model derived using Backward Elimination.
Decide on the criteria for selecting the best multiple linear regression model. Will you select models with slightly worse performance if it makes more sense or is more parsimonious? Discuss why you selected your models.
For the multiple linear regression model, will you use a metric such as Adjusted R2, RMSE, etc.? Be sure to explain how you can make inferences from the model, discuss multi-collinearity issues (if any), and discuss other relevant model output. Using the training data set, evaluate the multiple linear regression model based on (a) mean squared error, (b) R2, (c) F-statistic, and (d) residual plots.
sum1 <- summary(forward_AMT)
sum2 <- summary(backward_AMT)
sum3 <- summary(adjustedr2_AMT)
sum4 <- summary(mallowscp_AMT)
library(lmtest)
dwtest(forward_AMT)
##
## Durbin-Watson test
##
## data: forward_AMT
## DW = 1.988, p-value = 0.294
## alternative hypothesis: true autocorrelation is greater than 0
dwtest(backward_AMT)
##
## Durbin-Watson test
##
## data: backward_AMT
## DW = 1.988, p-value = 0.2932
## alternative hypothesis: true autocorrelation is greater than 0
dwtest(adjustedr2_AMT)
##
## Durbin-Watson test
##
## data: adjustedr2_AMT
## DW = 1.988, p-value = 0.2932
## alternative hypothesis: true autocorrelation is greater than 0
dwtest(mallowscp_AMT)
##
## Durbin-Watson test
##
## data: mallowscp_AMT
## DW = 1.988, p-value = 0.2932
## alternative hypothesis: true autocorrelation is greater than 0
The null hypothesis is that there does not exist autocorrelation (multicollinearity). Since the p-valuea are large, we fail to reject the null hypothesis.
summary(M[1:n, "TARGET_AMT"])
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 0 0 1504 1036 107600
data.frame("MODEL" = c("forward_AMT", "backward_AMT", "adjustedr2_AMT", "mallowscp_AMT"),
"MSE" = c(sum1$sigma^2, sum2$sigma^2, sum3$sigma^2, sum4$sigma^2),
"RMSE" = c(sum1$sigma, sum2$sigma, sum3$sigma, sum4$sigma))
## MODEL MSE RMSE
## 1 forward_AMT 20795666 4560.227
## 2 backward_AMT 20795771 4560.238
## 3 adjustedr2_AMT 20795771 4560.238
## 4 mallowscp_AMT 20795771 4560.238
The Mean Squared Error is the square of the RMSE. The benefit of using the RMSE is that it is expressed in the same units as the target variable. For these models, we see that standard error of the mean (RMSE) is fairly large relative to the target variable. In these models, the standard deviation of the unexplained variance in TARGET_AMT
are in the neighborhood of 4560 which is a large deviation from the 1504.325 average claim encountered in the data. This model can likely be improved upon by adding, combining, or transforming other variables that were filtered out early in this analysis.
data.frame("MODEL" = c("forward_AMT", "backward_AMT", "adjustedr2_AMT", "mallowscp_AMT"),
"R.SQUARED" = c(sum1$r.squared, sum2$r.squared, sum3$r.squared, sum4$r.squared),
"ADJ.R.SQUARED" = c(sum1$adj.r.squared, sum2$adj.r.squared, sum3$adj.r.squared, sum4$adj.r.squared))
## MODEL R.SQUARED ADJ.R.SQUARED
## 1 forward_AMT 0.14876730 0.14730452
## 2 backward_AMT 0.06169727 0.06020004
## 3 adjustedr2_AMT 0.06169727 0.06020004
## 4 mallowscp_AMT 0.06169727 0.06020004
\(R^2\) represents the percent change in \(Y\) explained by the predictor variables. \(R^2\) is fairly low for this model. \(R^2\) however, is not an adequate performance measure for this model. Adjusted \(R^2\) is more appropriate when models have multiple variables. It incorporates a penalty to account for the decrease in degrees of freedom (from additional variables). This penalty does not improve the evaluation however. Adjusted \(R^2\) is less than the already low \(R^2\). Again, it is very likely that this model be improved upon.
data.frame("MODEL" = c("forward_AMT", "backward_AMT", "adjustedr2_AMT", "mallowscp_AMT"),
rbind(sum1$fstatistic, sum2$fstatistic, sum3$fstatistic, sum4$fstatistic))
## MODEL value numdf dendf
## 1 forward_AMT 101.7019 14 8147
## 2 backward_AMT 41.2076 13 8147
## 3 adjustedr2_AMT 41.2076 13 8147
## 4 mallowscp_AMT 41.2076 13 8147
The \(F\)-test evaluates the null hypothesis that all regression coefficients are equal to zero versus the alternative that at least one does not. At an \(\alpha=0.01\) the \(F\)-statistic which indicates that the effect of the model is not “a spurious result of oddities in the data set.”
par(mfrow = c(2,2))
plot(forward_AMT)
par(mfrow = c(2,2))
plot(backward_AMT)
par(mfrow = c(2,2))
plot(adjustedr2_AMT)
par(mfrow = c(2,2))
plot(mallowscp_AMT)
The Residuals vs Fitted plot shows that the residuals do not have a linear pattern. The Normal Q-Q plot shows that the residuals are also not normally distributed. The Scale-Location plot appears to show some heteroscedasticity since the line is not horizontal with equally (randomly) spread points. The Residuals vs Leverage plot does show point 7691 extreme values outside the Cooks distance (dashed curve) that influence the (solid) regression line. Point 7691 also stands out in other plots with points 5389 and 7072.
The model derived using Forward Selection has the lowest RMSE, although only by a minimal amount. When it comes to the \(F\)-statistics and Adjusted \(R^2\) however, the forward_AMT
model has values that are substantially higher. The chosen model for TARGET_AMT
is therefore forward_AMT
.
Decide on the criteria for selecting the best binary logistic regression model. Will you select models with slightly worse performance if it makes more sense or is more parsimonious? Discuss why you selected your models.
For the binary logistic regression model, will you use a metric such as log likelihood, AIC, ROC curve, etc.? Using the training data set, evaluate the binary logistic regression model based on (a) accuracy, (b) classification error rate, (c) precision, (d) sensitivity, (e) specificity, (f) F1 score, (g) AUC, and (h) confusion matrix. Make predictions using the evaluation data set.
library(caret)
training_FLAG[ ,"probability.forward"] <- predict(forward_FLAG, training_FLAG, type="response")
training_FLAG[ ,"class.forward"] <- ifelse(training_FLAG$probability.forward < 0.5, 0, 1)
(cm1 <- confusionMatrix(training_FLAG$class.forward, training_FLAG$TARGET_FLAG, positive = "1"))
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 5533 1297
## 1 475 856
##
## Accuracy : 0.7829
## 95% CI : (0.7738, 0.7918)
## No Information Rate : 0.7362
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.363
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.3976
## Specificity : 0.9209
## Pos Pred Value : 0.6431
## Neg Pred Value : 0.8101
## Prevalence : 0.2638
## Detection Rate : 0.1049
## Detection Prevalence : 0.1631
## Balanced Accuracy : 0.6593
##
## 'Positive' Class : 1
##
The model derived using Forward Selection has the following performance metrics: Accuracy of 0.7828697, Error Rate of 0.2171303, Precision of 0.6431255, Sensitivity of 0.3975848, Specificity of 0.9209387, and \(F_1\) Score of 0.4913892.
training_FLAG[ ,"probability.backward"] <- predict(backward_FLAG, training_FLAG, type="response")
training_FLAG[ ,"class.backward"] <- ifelse(training_FLAG$probability.backward < 0.5, 0, 1)
(cm2 <- confusionMatrix(training_FLAG$class.backward, training_FLAG$TARGET_FLAG, positive = "1"))
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 5536 1295
## 1 472 858
##
## Accuracy : 0.7835
## 95% CI : (0.7744, 0.7924)
## No Information Rate : 0.7362
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.3647
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.3985
## Specificity : 0.9214
## Pos Pred Value : 0.6451
## Neg Pred Value : 0.8104
## Prevalence : 0.2638
## Detection Rate : 0.1051
## Detection Prevalence : 0.1630
## Balanced Accuracy : 0.6600
##
## 'Positive' Class : 1
##
The model derived using Backward Elimination has the following performance metrics: Accuracy of 0.7834824, Error Rate of 0.2165176, Precision of 0.6451128, Sensitivity of 0.3985137, Specificity of 0.9214381, and \(F_1\) Score of 0.4926787.
training_FLAG[ ,"probability.adjustedr2"] <- predict(adjustedr2_FLAG, training_FLAG, type="response")
training_FLAG[ ,"class.adjustedr2"] <- ifelse(training_FLAG$probability.adjustedr2 < 0.5, 0, 1)
(cm3 <- confusionMatrix(training_FLAG$class.adjustedr2, training_FLAG$TARGET_FLAG, positive = "1"))
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 5536 1295
## 1 472 858
##
## Accuracy : 0.7835
## 95% CI : (0.7744, 0.7924)
## No Information Rate : 0.7362
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.3647
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.3985
## Specificity : 0.9214
## Pos Pred Value : 0.6451
## Neg Pred Value : 0.8104
## Prevalence : 0.2638
## Detection Rate : 0.1051
## Detection Prevalence : 0.1630
## Balanced Accuracy : 0.6600
##
## 'Positive' Class : 1
##
The model derived using Adjusted \(R^2\) has the following performance metrics: Accuracy of 0.7834824, Error Rate of 0.2165176, Precision of , Sensitivity of 0.3985137, Specificity of 0.9214381, and \(F_1\) Score of 0.4926787. These metrics are identical to those from Backward Elimination since, as previously mentioned, both models identical.
training_FLAG[ ,"probability.mallowscp"] <- predict(mallowscp_FLAG, training_FLAG, type="response")
training_FLAG[ ,"class.mallowscp"] <- ifelse(training_FLAG$probability.mallowscp < 0.5, 0, 1)
(cm4 <- confusionMatrix(training_FLAG$class.mallowscp, training_FLAG$TARGET_FLAG, positive = "1"))
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 5536 1295
## 1 472 858
##
## Accuracy : 0.7835
## 95% CI : (0.7744, 0.7924)
## No Information Rate : 0.7362
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.3647
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.3985
## Specificity : 0.9214
## Pos Pred Value : 0.6451
## Neg Pred Value : 0.8104
## Prevalence : 0.2638
## Detection Rate : 0.1051
## Detection Prevalence : 0.1630
## Balanced Accuracy : 0.6600
##
## 'Positive' Class : 1
##
The model derived using Mallows \(C_p\) has the following performance metrics: Accuracy of 0.7834824, Error Rate of 0.2165176, Precision of , Sensitivity of 0.3985137, Specificity of 0.9214381, and \(F_1\) Score of 0.4926787. These metrics are identical to those from Backward Elimination since, as previously mentioned, both models identical.
library(pROC)
par(mfrow=c(2,2))
plot(roc(training_FLAG$TARGET_FLAG, training_FLAG$class.forward, smooth=F), print.auc=TRUE)
plot(roc(training_FLAG$TARGET_FLAG, training_FLAG$class.backward, smooth=F), print.auc=TRUE)
plot(roc(training_FLAG$TARGET_FLAG, training_FLAG$probability.adjustedr2, smooth=F), print.auc=TRUE)
plot(roc(training_FLAG$TARGET_FLAG, training_FLAG$probability.mallowscp, smooth=F), print.auc=TRUE)
The models with the highest Accuracy were those derived using Backward Elimination, Adjusted \(R^2\), and Mallows \(C_p\) models yielded identical results. These models with the greatest Area Under the ROC Curve however, were the adjustedr2_AMT
and mallowscp_AMT
models. As such, the backward_AMT
is being eliminated and there is indifference between using the adjustedr2_AMT
and mallowscp_AMT
models for TARGET_FLAG
.
Make predictions using the evaluation data set.
validation <- M[(1+n):(m+n),]
probability <- predict(adjustedr2_FLAG, validation, type="response")
predict_FLAG <- ifelse(probability >= .5, 1, 0)
predict_AMT <- predict(forward_AMT, validation)
predict_AMT[predict_FLAG == 0] <- 0
predictions <- data.frame("predict_FLAG" = predict_FLAG, "predict_AMT" = predict_AMT)
display(predictions)
n <- sum(training_FLAG$TARGET_FLAG)
N <- nrow(training_FLAG)
m <- sum(predict_FLAG)
M <- length(predict_FLAG)
binom.test(m, M, n / N)
##
## Exact binomial test
##
## data: m and M
## number of successes = 383, number of trials = 2141, p-value <
## 2.2e-16
## alternative hypothesis: true probability of success is not equal to 0.2638157
## 95 percent confidence interval:
## 0.1628687 0.1957921
## sample estimates:
## probability of success
## 0.1788884
The prevalence of the positive condition is 26.38% in the training data and 17.89% in the evaluation data results. Although there is some difference in these figures, the difference is not significant at an \(\alpha = 0.05\) as can be seen in the above Binomial test.