In this homework assignment, you will explore, analyze and model a data set containing information on crime for various neighborhoods of a major city. Each record has a response variable indicating whether or not the crime rate is above the median crime rate (1) or not (0).
Your objective is to build a binary logistic regression model on the training data set to predict whether the neighborhood will be at risk for high crime levels. You will provide classifications and probabilities for the evaluation data set using your binary logistic regression model. You can only use the variables given to you (or, variables that you derive from the variables provided). Below is a short description of the variables of interest in the data set:
Variable | Description | Type |
---|---|---|
zn |
Proportion of residential land zoned for large lots (over 25000 square feet) | Predictor |
indus |
Proportion of non-retail business acres per suburb | Predictor |
chas |
Dummy variable for whether the suburb borders the Charles River (1) or not (0) | Predictor |
nox |
Nitrogen oxides concentration (parts per 10 million) | Predictor |
rm |
Average number of rooms per dwelling | Predictor |
age |
Proportion of owner-occupied units built prior to 1940 | Predictor |
dis |
Weighted mean of distances to five Boston employment centers | Predictor |
rad |
Index of accessibility to radial highways | Predictor |
tax |
Full-value property-tax rate per $10,000 | Predictor |
ptratio |
Pupil-teacher ratio by town | Predictor |
black |
\(1000(\textrm{Bk} - 0.63)^2\) where Bk is the proportion of blacks by town | Predictor |
lstat |
Lower status of the population (percent) | Predictor |
medv |
Median value of owner-occupied homes in $1000s | Predictor |
target |
Whether the crime rate is above the median crime rate (1) or not (0) | Response |
A write-up submitted in PDF format. Your write-up should have four sections. Each one is described below. You may assume you are addressing me as a fellow data scientist, so do not need to shy away from technical details. Assigned prediction (probabilities, classifications) for the evaluation data set. Use 0.5 threshold.
training <- read.csv(paste0("https://raw.githubusercontent.com/jzuniga123/",
"SPS/master/DATA%20621/crime-training-data.csv"))
evaluation <- read.csv(paste0("https://raw.githubusercontent.com/jzuniga123/",
"SPS/master/DATA%20621/crime-evaluation-data.csv"))
X <- training[ , -14]
Y <- training[ , 14]
Z <- evaluation[ , -14]
nonbinary <- c(1:2, 4:13)
Describe the size and the variables in the crime training data set. Consider that too much detail will cause a manager to lose interest while too little detail will make the manager consider that you aren’t doing your job. Some suggestions are given below. Please do NOT treat this as a check list of things to do to complete the assignment. You should have your own thoughts on what to tell the boss. These are just ideas.
summary(training, digits = 2)
## zn indus chas nox
## Min. : 0 Min. : 0.46 Min. :0.000 Min. :0.39
## 1st Qu.: 0 1st Qu.: 5.14 1st Qu.:0.000 1st Qu.:0.45
## Median : 0 Median : 9.69 Median :0.000 Median :0.54
## Mean : 12 Mean :11.11 Mean :0.071 Mean :0.55
## 3rd Qu.: 16 3rd Qu.:18.10 3rd Qu.:0.000 3rd Qu.:0.62
## Max. :100 Max. :27.74 Max. :1.000 Max. :0.87
## rm age dis rad tax
## Min. :3.9 Min. : 2.9 Min. : 1.1 Min. : 1.0 Min. :187
## 1st Qu.:5.9 1st Qu.: 43.9 1st Qu.: 2.1 1st Qu.: 4.0 1st Qu.:281
## Median :6.2 Median : 77.2 Median : 3.2 Median : 5.0 Median :334
## Mean :6.3 Mean : 68.4 Mean : 3.8 Mean : 9.5 Mean :410
## 3rd Qu.:6.6 3rd Qu.: 94.1 3rd Qu.: 5.2 3rd Qu.:24.0 3rd Qu.:666
## Max. :8.8 Max. :100.0 Max. :12.1 Max. :24.0 Max. :711
## ptratio black lstat medv target
## Min. :13 Min. : 0.32 Min. : 1.7 Min. : 5 Min. :0.00
## 1st Qu.:17 1st Qu.:375.61 1st Qu.: 7.0 1st Qu.:17 1st Qu.:0.00
## Median :19 Median :391.34 Median :11.3 Median :21 Median :0.00
## Mean :18 Mean :357.12 Mean :12.6 Mean :23 Mean :0.49
## 3rd Qu.:20 3rd Qu.:396.24 3rd Qu.:16.9 3rd Qu.:25 3rd Qu.:1.00
## Max. :22 Max. :396.90 Max. :38.0 Max. :50 Max. :1.00
all(complete.cases(training))
## [1] TRUE
Looking at the data summaries for the training dataset, we can see that no variables have missing data. Each row of the training dataset contains a complete case. We can also see multiple skewed variables with large maximums relative to their mean and median. Other variables yields nearly identical means and medians.
par(mfrow = c(3,4))
for (i in nonbinary) {
plot(X[,i], main = names(X[i]))
}
Most of the nonbinary variables do not appear to be normal or random.
plot(X[,nonbinary])
The Scatterplot Matrix appears to indicate the existance of relationships between the nonbinary variables.
par(mfrow = c(3,4))
for (i in nonbinary) {
hist(X[ ,i], xlab = names(X[i]), main = names(X[i]))
d <- density(X[,i])
plot(d, main = names(X[i]))
polygon(d, col="red")
}
The histograms and density plots give a better understanding of how the nonbinary data are distributed. The distribution of the variable rm
could pass for some variation of a Gaussian or Student-\(t\) distribution. The variables zn
, age
, dis
, ptratio
, black
, and lstat
show heavy skewing. The variables indus
, rad
, and tax
have multimodal distributions.
library(ggplot2)
library(reshape2)
ggplot(data = melt(abs(cor(training))), aes(x=Var1, y=Var2, fill=value)) +
scale_fill_gradient(low = 'black', high = 'red', name = "Absolute Value") +
geom_tile() + labs(title = "Correlation Heatmap") +
theme(axis.title.y = element_blank(),
axis.title.x = element_blank(),
plot.title = element_text(hjust = 0.5))
There are high levels of correlation between multiple predictor variables and the response variable. The variables chas
, rm
, medv
, and ptratio
appear to have weak correlations with the response variable.
PCA <- function(X) {
Xpca <- prcomp(X, center = T, scale. = T)
M <- as.matrix(X); R <- as.matrix(Xpca$rotation); score <- M %*% R
print(list("Importance of Components" = summary(Xpca)$importance[ ,1:5],
"Rotation (Variable Loadings)" = Xpca$rotation[ ,1:5],
"Correlation between X and PC" = cor(X, score)[ ,1:5]))
par(mfrow=c(2,3))
barplot(Xpca$sdev^2, ylab = "Component Variance")
barplot(cor(cbind(X)), ylab = "Correlations")
barplot(Xpca$rotation, ylab = "Loadings")
biplot(Xpca); barplot(M); barplot(score)
}
PCA(X[, nonbinary])
## $`Importance of Components`
## PC1 PC2 PC3 PC4 PC5
## Standard deviation 2.496896 1.236621 1.089458 0.9232161 0.8084919
## Proportion of Variance 0.519540 0.127440 0.098910 0.0710300 0.0544700
## Cumulative Proportion 0.519540 0.646980 0.745890 0.8169100 0.8713900
##
## $`Rotation (Variable Loadings)`
## PC1 PC2 PC3 PC4 PC5
## zn 0.2617869 0.10179109 -0.42024848 0.32932260 -0.44849584
## indus -0.3445800 -0.10979620 0.01624666 -0.02562960 -0.14882012
## nox -0.3331737 -0.26875570 0.07856781 0.20688255 -0.16935178
## rm 0.2130233 -0.53607535 -0.26097007 -0.19937642 0.13225427
## age -0.3082582 -0.26412180 0.24864741 0.04212749 -0.01974695
## dis 0.3083629 0.37071984 -0.22879321 0.05489440 -0.05789147
## rad -0.3031698 -0.07819297 -0.47705930 -0.17008639 -0.21060993
## tax -0.3288806 -0.04433671 -0.41547848 -0.10233784 -0.28935174
## ptratio -0.2148019 0.33932419 -0.15768029 -0.67741655 0.23936038
## black 0.1959990 0.04073157 0.43670630 -0.45832596 -0.72554324
## lstat -0.3232014 0.23280836 0.12543285 0.27134698 -0.11713364
## medv 0.2777568 -0.48135626 -0.05621664 -0.14684284 -0.03583048
##
## $`Correlation between X and PC`
## PC1 PC2 PC3 PC4 PC5
## zn 0.4573489 0.53867205 0.1746846 0.125418514 -0.08577190
## indus -0.7858273 -0.73097510 -0.6210287 -0.020502887 -0.13891401
## nox -0.7351319 -0.76768467 -0.5654169 0.052428758 -0.05776395
## rm 0.3483940 0.06549015 0.2288614 0.009029718 0.05338448
## age -0.6375446 -0.81029567 -0.3794115 -0.002478123 -0.04751827
## dis 0.6350597 0.78248995 0.4167167 0.026349654 0.02748995
## rad -0.8824983 -0.74300713 -0.8517591 0.011500273 -0.21941612
## tax -0.9579570 -0.78307294 -0.9110464 -0.024360439 -0.29499715
## ptratio -0.4874748 -0.27110504 -0.3950028 -0.105792376 -0.12579812
## black 0.6207010 0.53932350 0.7456097 -0.870090627 -0.71581134
## lstat -0.6602411 -0.45765729 -0.4962276 0.105613824 -0.02746172
## medv 0.5679821 0.21017878 0.4512497 -0.113777300 -0.00807784
Detailed above are the first five Principal Components which account for the majority of the variation as indicated by the Component Variance bar chart. Looking at the absolute magnitude of the correlations between the variables and the primary Principal Component suggests that the main underlying mechanisms explaining the variance of the data, in order of magnitude are tax
, rad
, and indus
which have a focus on property values. The secondary, and less important, Principal Component draws most of its variation in order of magnitude are from age
, tax
, and dis
which hone in on socio economic status of the community residents.
Describe how you have transformed the data by changing the original variables or creating new variables. If you did transform the data or create new variables, discuss why you did this. Here are some possible transformations.
all(complete.cases(training))
## [1] TRUE
No variables have missing data. Each row of the training data set contains a complete case with no missing values. As a result we do not need to develop methods for immputation going forward.
corr_XY <- data.frame(array(NA, dim = c(ncol(X), 5)))
colnames(corr_XY) <- c("Y", "X", "r","p","<0.05")
for (i in 1:ncol(X)) {
r <- cor.test(Y, X[,i])
corr_XY[i, 1] <- "target"
corr_XY[i, 2] <- names(X[i])
corr_XY[i, 3] <- r$estimate
corr_XY[i, 4] <- r$p.value
corr_XY[i, 5] <- corr_XY[i, 4] < 0.05
}; corr_XY
## Y X r p <0.05
## 1 target zn -0.43168176 1.415603e-22 TRUE
## 2 target indus 0.60485074 7.845651e-48 TRUE
## 3 target chas 0.08004187 8.434811e-02 FALSE
## 4 target nox 0.72610622 1.680131e-77 TRUE
## 5 target rm -0.15255334 9.542249e-04 TRUE
## 6 target age 0.63010625 6.245736e-53 TRUE
## 7 target dis -0.61867312 1.450176e-50 TRUE
## 8 target rad 0.62810492 1.647597e-52 TRUE
## 9 target tax 0.61111331 4.709566e-49 TRUE
## 10 target ptratio 0.25084892 4.052730e-08 TRUE
## 11 target black -0.35295680 4.062541e-15 TRUE
## 12 target lstat 0.46912702 7.071431e-27 TRUE
## 13 target medv -0.27055071 2.924931e-09 TRUE
There is significant correlation between the response variable and nearly all of the predictor variables at an \(\alpha=0.05\) significance level. The predictor variable chas
does not have statistically significant correlation with the response variable and is therefore not being considered for the model.
X <- X[ , -which(names(X) %in% "chas")]
Z <- Z[ , -which(names(Z) %in% "chas")]
corr_XX <- data.frame(array(NA, dim = c(choose(ncol(X), 2), 5)))
colnames(corr_XX) <- c("X1", "X2", "r","p","<0.05"); k = 1
for (i in 1:(ncol(X) - 1)) {
for (j in (i+1):ncol(X)) {
r <- cor.test(X[,i], X[,j])
corr_XX[k, 1] <- names(X[i])
corr_XX[k, 2] <- names(X[j])
corr_XX[k, 3] <- r$estimate
corr_XX[k, 4] <- r$p.value
corr_XX[k, 5] <- corr_XX[i, 4] < 0.05
k = k + 1
}
}; head(corr_XX); all(corr_XX[,5])
## X1 X2 r p <0.05
## 1 zn indus -0.5382664 2.319176e-36 TRUE
## 2 zn nox -0.5170452 3.238655e-33 TRUE
## 3 zn rm 0.3198141 1.528592e-12 TRUE
## 4 zn age -0.5725805 6.040999e-42 TRUE
## 5 zn dis 0.6601243 1.220397e-59 TRUE
## 6 zn rad -0.3154812 3.151098e-12 TRUE
## [1] TRUE
There is significant correlation between nearly all of the remaining predictor variables at an \(\alpha=0.05\) significance level. These correlations run the whole gamut from weak, to moderate, to strong.
correlations <- function(data, threshold) {
columns <- ncol(data)
cors <- round(cor(data[,-columns]), 5)
cors[upper.tri(cors, diag = T)] <- NA
cors <- na.omit(melt(cors))
abs_cors <- abs(cors["value"])
cors[abs_cors < 1 & abs_cors > threshold, ]
}
correlations(X, 0.7)
## Var1 Var2 value
## 14 nox indus 0.75963
## 17 dis indus -0.70362
## 19 tax indus 0.73223
## 27 age nox 0.73513
## 28 dis nox -0.76888
## 50 dis age -0.75090
## 74 tax rad 0.90646
The variables nox
, indus
, dis
, age
, and medv
have strong correlations with multiple variables, but the highest correlation of 0.9064632 is between tax
and rad
. This correlation is so high that it makes sense to exclude one of the variables in the model. Therefore , given that \(cor(\textrm{target},\textrm{rad})=0.6281049\) \(>\) \(cor(\textrm{target}, \textrm{tax})=0.6111133\), it is the predictor variable tax
which is the least correlated with the response variable that will not be considered for the model.
X <- X[ , -which(names(X) %in% "tax")]
Z <- Z[ , -which(names(Z) %in% "tax")]
depend <- data.frame(array(NA, dim = c(ncol(X), 5)))
colnames(depend) <- c("Y", "X", "chi-stat", "Chi-test","<0.05")
for (i in 1:ncol(X)) {
chi <- chisq.test(table(Y, X[,i]), simulate.p.value = TRUE)
depend[i, 1] <- "target"
depend[i, 2] <- names(X[i])
depend[i, 3] <- chi$statistic
depend[i, 4] <- chi$p.value
depend[i, 5] <- depend[i, 4] < 0.05
}; depend
## Y X chi-stat Chi-test <0.05
## 1 target zn 124.2333 0.0004997501 TRUE
## 2 target indus 423.2214 0.0004997501 TRUE
## 3 target nox 418.0914 0.0004997501 TRUE
## 4 target rm 433.3237 0.0194902549 TRUE
## 5 target age 405.7441 0.0004997501 TRUE
## 6 target dis 455.9971 0.0004997501 TRUE
## 7 target rad 229.9859 0.0004997501 TRUE
## 8 target ptratio 360.8258 0.0004997501 TRUE
## 9 target black 342.6660 0.0064967516 TRUE
## 10 target lstat 415.9853 0.9260369815 FALSE
## 11 target medv 295.4259 0.0004997501 TRUE
All but one of the \(X\) variables show evidence of statistically significant dependence with \(Y\). The variable lstat
that is not statistically significantly dependent will not be included in the model. It is worth noting the function simulated p-values where very small expected values posed the potential for producing incorrect approximations of p. This simulation lead to duplicate p-values in some cases, but the \(\chi^2\) test statistics remained unique.
X <- X[ , -which(names(X) %in% "lstat")]
Z <- Z[ , -which(names(Z) %in% "lstat")]
library(MASS)
potential <- match(c("zn","dis","medv"), names(X))
lambda <- numeric(length(potential))
par(mfrow=c(1,3))
for (i in potential) {
shifted <- X[,i] - min(X[,i]) + 1e-32
fit_exp <- fitdistr(shifted, "Exponential")
lambda[i] <- fit_exp$estimate
exp <- rexp(1000, lambda[i])
hist(X[,i], prob=TRUE, col="grey", main =names(X[i]),
xlab=paste("Lambda =",fractions(lambda[i])))
lines(density(exp), col="blue", lwd=2)
}
Only two of three potential variables lend themselves toward modeling with an exponential distribution. The variables were shifted to slightly above zero by subtracting the minimum value and then adding \(1^{-32}\) to the modified value. This would also shift data with a negative minimum in the appropriate direction since subtracting the negative minimum value equates to adding the minimum value.
X[, "log_zn"] <- log(X[, "zn"] - min(X[, "zn"]) + 1e-32, lambda[1])
X[, "log_dis"] <- log(X[, "dis"] - min(X[, "dis"]) + 1e-32, lambda[2])
Z[, "log_zn"] <- log(Z[, "zn"] - min(Z[, "zn"]) + 1e-32, lambda[1])
Z[, "log_dis"] <- log(Z[, "dis"] - min(Z[, "dis"]) + 1e-32, lambda[2])
par(mfrow=c(1,2))
plot(X[,"indus"]); abline(h = 15, col="red"); abline(h = 18, col="blue")
plot(X[,"rad"]); abline(h = 8, col="red"); abline(h = 24, col="blue")
The variables indus
and rad
predictor variables have bimodal distributions. There are clear lines of demarcation in the values that we can use to bifurcate the variables into categories.
X[,"cat_indus"] <- ifelse(X[,"indus"] < (15 + 18) / 2, 0, 1)
X[,"cat_rad"] <- ifelse(X[,"rad"] < (8 + 24) / 2, 0, 1)
Z[,"cat_indus"] <- ifelse(Z[,"indus"] < (15 + 18) / 2, 0, 1)
Z[,"cat_rad"] <- ifelse(Z[,"rad"] < (8 + 24) / 2, 0, 1)
library(car)
potential <- match(c("nox","rm","age","ptratio","black","medv"), names(X))
box.cox.powers <- powerTransform(X[, potential], family="bcPower")
summary(box.cox.powers)
## bcPower Transformations to Multinormality
## Est.Power Std.Err. Wald Lower Bound Wald Upper Bound
## nox -1.4797 0.1966 -1.8650 -1.0944
## rm 0.9578 0.2409 0.4857 1.4299
## age 1.4414 0.0895 1.2658 1.6169
## ptratio 4.3789 0.4112 3.5729 5.1849
## black 3.5877 0.1767 3.2414 3.9341
## medv 0.3632 0.0684 0.2291 0.4974
##
## Likelihood ratio tests about transformation parameters
## LRT df pval
## LR test, lambda = (0 0 0 0 0 0) 2010.4335224 6 0.0000000
## LR test, lambda = (1 1 1 1 1 1) 770.6317985 6 0.0000000
## LR test, lambda = (-1.48 1 1.44 4.38 3.59 0.33) 0.3049445 6 0.9994728
Five of the six potential variables lend themselves toward Box-Cox transformation. The variable rm
has a power very near one which indicates no transformation. This is further supported by the lower and upper bounds which hover around one.
X[, "box_nox"] <- X[, "nox"]^(box.cox.powers$lambda[1])
X[, "box_age"] <- X[, "age"]^(box.cox.powers$lambda[3])
X[, "box_ptratio"] <- X[, "ptratio"]^(box.cox.powers$lambda[4])
X[, "box_black"] <- X[, "black"]^(box.cox.powers$lambda[5])
X[, "box_medv"] <- X[, "medv"]^(box.cox.powers$lambda[6])
Z[, "box_nox"] <- Z[, "nox"]^(box.cox.powers$lambda[1])
Z[, "box_age"] <- Z[, "age"]^(box.cox.powers$lambda[3])
Z[, "box_ptratio"] <- Z[, "ptratio"]^(box.cox.powers$lambda[4])
Z[, "box_black"] <- Z[, "black"]^(box.cox.powers$lambda[5])
Z[, "box_medv"] <- Z[, "medv"]^(box.cox.powers$lambda[6])
Using the training data, build at least three different binary logistic regression models, using different variables (or the same variables with different transformations). You may select the variables manually, use an approach such as Forward or Stepwise, use a different approach, or use a combination of techniques. Describe the techniques you used. If you manually selected a variable for inclusion into the model or exclusion into the model, indicate why this was done.
Be sure to explain how you can make inferences from the model, as well as discuss other relevant model output. Discuss the coefficients in the models, do they make sense? Are you keeping the model even though it is counter intuitive? Why? The boss needs to know.
Forward stepwise subset selection based on \(AIC\). Using \(k = 2\) degrees of freedom for the penalty gives the genuine \(AIC\). Using \(k = log(n)\) is sometimes referred to as BIC or SBC.
training <- data.frame(X[,c(4, 11:19)], "target" = Y)
null <- glm(target ~ 0, family = binomial(link = "logit"), training)
full <- glm(target ~ ., family = binomial(link = "logit"), training)
aic_steps <- step(null, scope=list(lower=null, upper=full), direction="forward", k = 2)
## Start: AIC=646.01
## target ~ 0
##
## Df Deviance AIC
## + cat_rad 1 478.27 480.27
## + cat_indus 1 488.09 490.09
## + box_age 1 607.52 609.52
## + log_zn 1 619.68 621.68
## + box_nox 1 622.82 624.82
## + box_black 1 635.03 637.03
## + box_ptratio 1 639.78 641.78
## <none> 646.01 646.01
## + box_medv 1 644.03 646.03
## + rm 1 645.47 647.47
##
## Step: AIC=480.27
## target ~ cat_rad - 1
##
## Df Deviance AIC
## + box_nox 1 388.35 392.35
## + box_black 1 412.54 416.54
## + box_medv 1 427.18 431.18
## + rm 1 428.57 432.57
## + box_ptratio 1 441.94 445.94
## + cat_indus 1 459.85 463.85
## + log_zn 1 474.90 478.90
## <none> 478.27 480.27
## + box_age 1 476.99 480.99
##
## Step: AIC=392.35
## target ~ cat_rad + box_nox - 1
##
## Df Deviance AIC
## + rm 1 259.19 265.19
## + box_medv 1 266.72 272.72
## + box_age 1 278.80 284.80
## + log_zn 1 328.91 334.91
## + cat_indus 1 338.87 344.87
## + box_black 1 362.75 368.75
## + box_ptratio 1 364.81 370.81
## <none> 388.35 392.35
##
## Step: AIC=265.19
## target ~ cat_rad + box_nox + rm - 1
##
## Df Deviance AIC
## + box_age 1 251.68 259.68
## + box_ptratio 1 251.90 259.90
## + log_zn 1 252.43 260.43
## + box_black 1 255.18 263.18
## <none> 259.19 265.19
## + box_medv 1 258.76 266.76
## + cat_indus 1 259.18 267.18
##
## Step: AIC=259.68
## target ~ cat_rad + box_nox + rm + box_age - 1
##
## Df Deviance AIC
## + box_black 1 245.84 255.84
## + box_ptratio 1 248.56 258.56
## + log_zn 1 248.93 258.93
## + box_medv 1 249.06 259.06
## <none> 251.68 259.68
## + cat_indus 1 250.82 260.82
##
## Step: AIC=255.83
## target ~ cat_rad + box_nox + rm + box_age + box_black - 1
##
## Df Deviance AIC
## + box_medv 1 240.95 252.95
## + box_ptratio 1 241.32 253.32
## + log_zn 1 242.41 254.41
## <none> 245.84 255.84
## + cat_indus 1 244.70 256.70
##
## Step: AIC=252.95
## target ~ cat_rad + box_nox + rm + box_age + box_black + box_medv -
## 1
##
## Df Deviance AIC
## + box_ptratio 1 231.56 245.56
## + log_zn 1 237.83 251.83
## <none> 240.95 252.95
## + cat_indus 1 239.36 253.36
##
## Step: AIC=245.55
## target ~ cat_rad + box_nox + rm + box_age + box_black + box_medv +
## box_ptratio - 1
##
## Df Deviance AIC
## + cat_indus 1 227.67 243.67
## <none> 231.56 245.56
## + log_zn 1 231.47 247.47
##
## Step: AIC=243.67
## target ~ cat_rad + box_nox + rm + box_age + box_black + box_medv +
## box_ptratio + cat_indus - 1
##
## Df Deviance AIC
## <none> 227.67 243.67
## + log_zn 1 227.57 245.57
The model target ~ cat_rad + box_nox + rm + box_age + box_black + box_medv + box_ptratio + cat_indus - 1 has the lowest AIC of 243.67.
forward <- glm(target ~ cat_rad + box_nox + rm + box_age + box_black + box_medv +
box_ptratio + cat_indus - 1, family = binomial(link = "logit"), training)
summary(forward)
##
## Call:
## glm(formula = target ~ cat_rad + box_nox + rm + box_age + box_black +
## box_medv + box_ptratio + cat_indus - 1, family = binomial(link = "logit"),
## data = training)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.53887 -0.28220 -0.03420 0.00018 2.94797
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## cat_rad 1.882e+01 8.365e+02 0.022 0.982050
## box_nox -3.580e+00 5.755e-01 -6.221 4.93e-10 ***
## rm -2.151e-01 5.025e-01 -0.428 0.668554
## box_age 3.141e-03 1.007e-03 3.120 0.001810 **
## box_black -1.747e-09 5.458e-10 -3.201 0.001368 **
## box_medv 3.404e+00 1.025e+00 3.320 0.000899 ***
## box_ptratio 3.704e-06 1.131e-06 3.276 0.001053 **
## cat_indus -1.074e+00 5.457e-01 -1.967 0.049133 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 646.01 on 466 degrees of freedom
## Residual deviance: 227.67 on 458 degrees of freedom
## AIC: 243.67
##
## Number of Fisher Scoring iterations: 18
Using the 8 forward selected variables leads to a model with two insignificant variables: cat_rad
and rm
. Removing those two insignificant variables causes cat indus
to become significant only at a level of \(\alpha=0.1\). If cat indus
is also removed, all remaining variable become significant at a level of \(\alpha=0.01\) or lower.
Backward stepwise subset elimination based on \(AIC\). Using \(k = 2\) degrees of freedom for the penalty gives the genuine \(AIC\). Using \(k = log(n)\) is sometimes referred to as BIC or SBC.
aic_steps <- step(full, scope=list(lower=null, upper=full), direction="backward", k = 2)
## Start: AIC=246.2
## target ~ rm + log_zn + log_dis + cat_indus + cat_rad + box_nox +
## box_age + box_ptratio + box_black + box_medv
##
##
## Step: AIC=246.2
## target ~ rm + log_zn + cat_indus + cat_rad + box_nox + box_age +
## box_ptratio + box_black + box_medv
##
## Df Deviance AIC
## - log_zn 1 226.20 244.20
## - rm 1 226.36 244.36
## <none> 226.20 246.20
## - box_age 1 231.13 249.13
## - cat_indus 1 231.25 249.25
## - box_ptratio 1 232.02 250.02
## - box_medv 1 233.86 251.86
## - box_black 1 240.63 258.63
## - cat_rad 1 258.81 276.81
## - box_nox 1 280.14 298.14
##
## Step: AIC=244.2
## target ~ rm + cat_indus + cat_rad + box_nox + box_age + box_ptratio +
## box_black + box_medv
##
## Df Deviance AIC
## - rm 1 226.36 242.36
## <none> 226.20 244.20
## - box_age 1 231.24 247.24
## - cat_indus 1 231.29 247.29
## - box_ptratio 1 233.09 249.09
## - box_medv 1 234.11 250.11
## - box_black 1 240.86 256.86
## - cat_rad 1 258.83 274.83
## - box_nox 1 282.45 298.45
##
## Step: AIC=242.36
## target ~ cat_indus + cat_rad + box_nox + box_age + box_ptratio +
## box_black + box_medv
##
## Df Deviance AIC
## <none> 226.36 242.36
## - cat_indus 1 231.43 245.43
## - box_age 1 231.49 245.49
## - box_ptratio 1 233.16 247.16
## - box_black 1 240.86 254.86
## - box_medv 1 247.76 261.76
## - cat_rad 1 259.85 273.85
## - box_nox 1 284.52 298.52
The model target ~ cat_indus + cat_rad + box_nox + box_age + box_ptratio + box_black + box_medv has the lowest AIC of 242.36.
backward <- glm(target ~ cat_indus + cat_rad + box_nox + box_age + box_ptratio +
box_black + box_medv, family = binomial(link = "logit"), training)
summary(backward)
##
## Call:
## glm(formula = target ~ cat_indus + cat_rad + box_nox + box_age +
## box_ptratio + box_black + box_medv, family = binomial(link = "logit"),
## data = training)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.69233 -0.26840 -0.03364 0.00017 2.98810
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 3.140e+00 2.604e+00 1.206 0.22799
## cat_indus -1.299e+00 5.814e-01 -2.234 0.02547 *
## cat_rad 1.865e+01 8.349e+02 0.022 0.98218
## box_nox -3.927e+00 6.446e-01 -6.091 1.12e-09 ***
## box_age 2.346e-03 1.050e-03 2.235 0.02541 *
## box_ptratio 3.022e-06 1.176e-06 2.569 0.01020 *
## box_black -1.965e-09 6.169e-10 -3.185 0.00145 **
## box_medv 2.610e+00 6.155e-01 4.241 2.22e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 645.88 on 465 degrees of freedom
## Residual deviance: 226.36 on 458 degrees of freedom
## AIC: 242.36
##
## Number of Fisher Scoring iterations: 18
Using the 8 backward eliminated variables leads to a model with two insignificant variables: cat_rad
and the intercept. Removing those two insignificant variables causes cat indus
to become significant only at a level of \(\alpha=0.1\). If cat indus
is also removed, all remaining variable become significant at a level of \(\alpha=0.01\) or lower. The model derived after culling significant variables from the backward elimination is the same model derived if significant variables were culled from the Forward Selection process.
library(leaps)
(model_sum <- summary(regsubsets(target ~ ., training)))
## Reordering variables and trying again:
## Subset selection object
## Call: regsubsets.formula(target ~ ., training)
## 10 Variables (and intercept)
## Forced in Forced out
## rm FALSE FALSE
## log_zn FALSE FALSE
## cat_indus FALSE FALSE
## cat_rad FALSE FALSE
## box_nox FALSE FALSE
## box_age FALSE FALSE
## box_ptratio FALSE FALSE
## box_black FALSE FALSE
## box_medv FALSE FALSE
## log_dis FALSE FALSE
## 1 subsets of each size up to 9
## Selection Algorithm: exhaustive
## rm log_zn log_dis cat_indus cat_rad box_nox box_age box_ptratio
## 1 ( 1 ) " " " " " " " " " " "*" " " " "
## 2 ( 1 ) " " " " " " " " "*" "*" " " " "
## 3 ( 1 ) " " " " " " " " "*" "*" " " " "
## 4 ( 1 ) " " " " " " " " "*" "*" "*" " "
## 5 ( 1 ) " " " " " " " " "*" "*" "*" " "
## 6 ( 1 ) " " " " " " " " "*" "*" "*" "*"
## 7 ( 1 ) " " " " " " "*" "*" "*" "*" "*"
## 8 ( 1 ) "*" " " " " "*" "*" "*" "*" "*"
## 9 ( 1 ) "*" "*" " " "*" "*" "*" "*" "*"
## box_black box_medv
## 1 ( 1 ) " " " "
## 2 ( 1 ) " " " "
## 3 ( 1 ) " " "*"
## 4 ( 1 ) " " "*"
## 5 ( 1 ) "*" "*"
## 6 ( 1 ) "*" "*"
## 7 ( 1 ) "*" "*"
## 8 ( 1 ) "*" "*"
## 9 ( 1 ) "*" "*"
plot(model_sum$adjr2, xlab = "Number of Variables", ylab = "Adj R-squared")
max(model_sum$adjr2)
## [1] 0.6221241
which.max(model_sum$adjr2)
## [1] 7
The maximum Adjusted \(R^2\) of 0.6221241 is reached when the model contains 7 variables. The variables in this model are nearly identical to the model derived using Backward Elimination. The sole difference is the inclusion of the intercept in this model.
model_sum$which[which.max(model_sum$adjr2), ]
## (Intercept) rm log_zn log_dis cat_indus cat_rad
## TRUE FALSE FALSE FALSE TRUE TRUE
## box_nox box_age box_ptratio box_black box_medv
## TRUE TRUE TRUE TRUE TRUE
adjustedr2 <- glm(target ~ 1 + cat_indus + cat_rad + box_nox + box_age +
box_ptratio + box_black + box_medv, family = binomial(link = "logit"), training)
summary(adjustedr2)
##
## Call:
## glm(formula = target ~ 1 + cat_indus + cat_rad + box_nox + box_age +
## box_ptratio + box_black + box_medv, family = binomial(link = "logit"),
## data = training)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.69233 -0.26840 -0.03364 0.00017 2.98810
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 3.140e+00 2.604e+00 1.206 0.22799
## cat_indus -1.299e+00 5.814e-01 -2.234 0.02547 *
## cat_rad 1.865e+01 8.349e+02 0.022 0.98218
## box_nox -3.927e+00 6.446e-01 -6.091 1.12e-09 ***
## box_age 2.346e-03 1.050e-03 2.235 0.02541 *
## box_ptratio 3.022e-06 1.176e-06 2.569 0.01020 *
## box_black -1.965e-09 6.169e-10 -3.185 0.00145 **
## box_medv 2.610e+00 6.155e-01 4.241 2.22e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 645.88 on 465 degrees of freedom
## Residual deviance: 226.36 on 458 degrees of freedom
## AIC: 242.36
##
## Number of Fisher Scoring iterations: 18
plot(model_sum$cp, xlab = "Number of Variables", ylab = "Mallows Cp")
min(model_sum$cp)
## [1] 5.378017
which.min(model_sum$cp)
## [1] 6
The minimum Mallows \(C_p\) of 5.3780169 is reached when the model contains 6 variables.
model_sum$which[which.min(model_sum$cp), ]
## (Intercept) rm log_zn log_dis cat_indus cat_rad
## TRUE FALSE FALSE FALSE FALSE TRUE
## box_nox box_age box_ptratio box_black box_medv
## TRUE TRUE TRUE TRUE TRUE
mallowscp <- glm(target ~ 1 + cat_rad + box_nox + box_age + box_ptratio +
box_black + box_medv, family = binomial(link = "logit"), training)
summary(mallowscp)
##
## Call:
## glm(formula = target ~ 1 + cat_rad + box_nox + box_age + box_ptratio +
## box_black + box_medv, family = binomial(link = "logit"),
## data = training)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.43901 -0.32234 -0.05351 0.00018 2.81144
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.303e+00 2.468e+00 0.528 0.59750
## cat_rad 1.793e+01 8.530e+02 0.021 0.98323
## box_nox -3.262e+00 5.210e-01 -6.261 3.82e-10 ***
## box_age 1.987e-03 1.019e-03 1.951 0.05110 .
## box_ptratio 2.823e-06 1.184e-06 2.384 0.01713 *
## box_black -1.730e-09 5.872e-10 -2.947 0.00321 **
## box_medv 2.524e+00 6.041e-01 4.178 2.94e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 645.88 on 465 degrees of freedom
## Residual deviance: 231.43 on 459 degrees of freedom
## AIC: 245.43
##
## Number of Fisher Scoring iterations: 18
Decide on the criteria for selecting the best binary logistic regression model. Will you select models with slightly worse performance if it makes more sense or is more parsimonious? Discuss why you selected your model.
For the binary logistic regression model, will you use a metric such as log likelihood, AIC, ROC curve, etc.? Using the training data set, evaluate the binary logistic regression model based on (a) accuracy, (b) classification error rate, (c) precision (Pos Pred Value), (d) sensitivity, (e) specificity, (f) F1 score, (g) AUC, and (h) confusion matrix.
library(caret)
training[ ,"probability.forward"] <- predict(forward, training, type="response")
training[ ,"class.forward"] <- ifelse(training$probability.forward < 0.5, 0, 1)
(cm1 <- confusionMatrix(training$class.forward, training$target, positive = "1"))
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 211 32
## 1 26 197
##
## Accuracy : 0.8755
## 95% CI : (0.8421, 0.9041)
## No Information Rate : 0.5086
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.7509
## Mcnemar's Test P-Value : 0.5115
##
## Sensitivity : 0.8603
## Specificity : 0.8903
## Pos Pred Value : 0.8834
## Neg Pred Value : 0.8683
## Prevalence : 0.4914
## Detection Rate : 0.4227
## Detection Prevalence : 0.4785
## Balanced Accuracy : 0.8753
##
## 'Positive' Class : 1
##
The model derived using Forward Selection has the following performance metrics: Accuracy of 0.8755365, Error Rate of 0.1244635, Precision of 0.8834081, Sensitivity of 0.860262, Specificity of 0.8902954, and \(F_1\) Score of 0.8716814.
training[ ,"probability.backward"] <- predict(backward, training, type="response")
training[ ,"class.backward"] <- ifelse(training$probability.backward < 0.5, 0, 1)
(cm2 <- confusionMatrix(training$class.backward, training$target, positive = "1"))
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 213 33
## 1 24 196
##
## Accuracy : 0.8777
## 95% CI : (0.8444, 0.906)
## No Information Rate : 0.5086
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.7551
## Mcnemar's Test P-Value : 0.2893
##
## Sensitivity : 0.8559
## Specificity : 0.8987
## Pos Pred Value : 0.8909
## Neg Pred Value : 0.8659
## Prevalence : 0.4914
## Detection Rate : 0.4206
## Detection Prevalence : 0.4721
## Balanced Accuracy : 0.8773
##
## 'Positive' Class : 1
##
The model derived using Backward Elimination has the following performance metrics: Accuracy of 0.8776824, Error Rate of 0.1223176, Precision of 0.8909091, Sensitivity of 0.8558952, Specificity of 0.8987342, and \(F_1\) Score of 0.8730512.
training[ ,"probability.adjustedr2"] <- predict(adjustedr2, training, type="response")
training[ ,"class.adjustedr2"] <- ifelse(training$probability.adjustedr2 < 0.5, 0, 1)
(cm3 <- confusionMatrix(training$class.adjustedr2, training$target, positive = "1"))
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 213 33
## 1 24 196
##
## Accuracy : 0.8777
## 95% CI : (0.8444, 0.906)
## No Information Rate : 0.5086
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.7551
## Mcnemar's Test P-Value : 0.2893
##
## Sensitivity : 0.8559
## Specificity : 0.8987
## Pos Pred Value : 0.8909
## Neg Pred Value : 0.8659
## Prevalence : 0.4914
## Detection Rate : 0.4206
## Detection Prevalence : 0.4721
## Balanced Accuracy : 0.8773
##
## 'Positive' Class : 1
##
The model derived using Adjusted \(R^2\) has the following performance metrics: Accuracy of 0.8776824, Error Rate of 0.1223176, Precision of 0.8909091, Sensitivity of 0.8558952, Specificity of 0.8987342, and \(F_1\) Score of 0.8730512. These metrics are identical to those from Backward Elimination since, as previously mentioned, both models are nearly idencial exept for the intercept.
training[ ,"probability.mallowscp"] <- predict(mallowscp, training, type="response")
training[ ,"class.mallowscp"] <- ifelse(training$probability.mallowscp < 0.5, 0, 1)
(cm4 <- confusionMatrix(training$class.mallowscp, training$target, positive = "1"))
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 215 36
## 1 22 193
##
## Accuracy : 0.8755
## 95% CI : (0.8421, 0.9041)
## No Information Rate : 0.5086
## P-Value [Acc > NIR] : < 2e-16
##
## Kappa : 0.7507
## Mcnemar's Test P-Value : 0.08783
##
## Sensitivity : 0.8428
## Specificity : 0.9072
## Pos Pred Value : 0.8977
## Neg Pred Value : 0.8566
## Prevalence : 0.4914
## Detection Rate : 0.4142
## Detection Prevalence : 0.4614
## Balanced Accuracy : 0.8750
##
## 'Positive' Class : 1
##
The model derived using Mallows \(C_p\) has the following performance metrics: Accuracy of 0.8755365, Error Rate of 0.1244635, Precision of 0.8976744, Sensitivity of 0.8427948, Specificity of 0.907173, and \(F_1\) Score of 0.8693694. This model has the same Accuracy as the model derived using Forward Selection, but with better Precision, Specificity, and \(F_1\) Score values. It does have a lower Sensitivity value however.
library(pROC)
par(mfrow=c(2,2))
plot(roc(training$target, training$class.forward, smooth=F), print.auc=TRUE)
plot(roc(training$target, training$class.backward, smooth=F), print.auc=TRUE)
plot(roc(training$target, training$probability.adjustedr2, smooth=F), print.auc=TRUE)
plot(roc(training$target, training$probability.mallowscp, smooth=F), print.auc=TRUE)
The models with the highest Accuracy were the Backward Elimination and the Adjusted \(R^2\) models. These models were nearly identical except that the latter explicitly calls for the intercept. This small difference gives the latter model a greater Area Under the ROC Curve which was the reason for selecting the latter model as the final model.
Make predictions using the evaluation data set.
evaluation <- Z[,c(4, 11:19)]
evaluation[, "pred.prob"] <- predict(adjustedr2, evaluation, type="response")
evaluation[, "pred.class"] <- ifelse(evaluation[, 11] >= 0.5, 1, 0);
evaluation[, c("pred.prob", "pred.class")]
## pred.prob pred.class
## 1 0.163154389 0
## 2 0.629665732 1
## 3 0.629928820 1
## 4 0.929447056 1
## 5 0.088687899 0
## 6 0.025469918 0
## 7 0.055673427 0
## 8 0.038245612 0
## 9 0.020602008 0
## 10 0.020450669 0
## 11 0.535389993 1
## 12 0.545614949 1
## 13 0.567608202 1
## 14 0.964682451 1
## 15 0.674913590 1
## 16 0.180760734 0
## 17 0.253280855 0
## 18 0.814707701 1
## 19 0.017192104 0
## 20 0.001858639 0
## 21 0.001836118 0
## 22 0.106569894 0
## 23 0.295344006 0
## 24 0.128110066 0
## 25 0.087140207 0
## 26 0.190403073 0
## 27 0.007822306 0
## 28 1.000000000 1
## 29 1.000000000 1
## 30 1.000000000 1
## 31 1.000000000 1
## 32 0.999999998 1
## 33 0.999999997 1
## 34 0.999999999 1
## 35 1.000000000 1
## 36 1.000000000 1
## 37 0.999999999 1
## 38 0.999999997 1
## 39 0.541989846 1
## 40 0.851430890 1
n <- sum(training$target)
N <- nrow(training)
m <- sum(evaluation$pred.class)
M <- nrow(evaluation)
binom.test(m, M, n / N)
##
## Exact binomial test
##
## data: m and M
## number of successes = 22, number of trials = 40, p-value = 0.528
## alternative hypothesis: true probability of success is not equal to 0.4914163
## 95 percent confidence interval:
## 0.3849068 0.7074116
## sample estimates:
## probability of success
## 0.55
The prevalence of the positive condition is 49.14% in the training data and 55% in the evaluation data results. Although there is some difference in these figures, the difference is not significant at an \(\alpha = 0.05\) as can be seen in the above Binomial test.
https://rpubs.com/josezuniga/253955
https://rpubs.com/josezuniga/234598
https://archive.ics.uci.edu/ml/datasets/Housing
http://www.statmethods.net/stats/regression.html
https://onlinecourses.science.psu.edu/stat505/node/54
http://www.instantr.com/2012/11/06/performing-a-binomial-test/
https://cran.r-project.org/web/packages/fitdistrplus/vignettes/paper2JSS.pdf
http://stackoverflow.com/questions/6782070/display-correlation-tables-as-descending-list
https://sites.google.com/site/econometricsacademy/econometrics-models/probit-and-logit-models
http://info.salford-systems.com/blog/bid/337783/Why-Data-Scientists-Split-Data-into-Train-and-Test