Overview

In this homework assignment, you will explore, analyze and model a data set containing information on crime for various neighborhoods of a major city. Each record has a response variable indicating whether or not the crime rate is above the median crime rate (1) or not (0).

Your objective is to build a binary logistic regression model on the training data set to predict whether the neighborhood will be at risk for high crime levels. You will provide classifications and probabilities for the evaluation data set using your binary logistic regression model. You can only use the variables given to you (or, variables that you derive from the variables provided). Below is a short description of the variables of interest in the data set:

Variable Description Type
zn Proportion of residential land zoned for large lots (over 25000 square feet) Predictor
indus Proportion of non-retail business acres per suburb Predictor
chas Dummy variable for whether the suburb borders the Charles River (1) or not (0) Predictor
nox Nitrogen oxides concentration (parts per 10 million) Predictor
rm Average number of rooms per dwelling Predictor
age Proportion of owner-occupied units built prior to 1940 Predictor
dis Weighted mean of distances to five Boston employment centers Predictor
rad Index of accessibility to radial highways Predictor
tax Full-value property-tax rate per $10,000 Predictor
ptratio Pupil-teacher ratio by town Predictor
black \(1000(\textrm{Bk} - 0.63)^2\) where Bk is the proportion of blacks by town Predictor
lstat Lower status of the population (percent) Predictor
medv Median value of owner-occupied homes in $1000s Predictor
target Whether the crime rate is above the median crime rate (1) or not (0) Response

A write-up submitted in PDF format. Your write-up should have four sections. Each one is described below. You may assume you are addressing me as a fellow data scientist, so do not need to shy away from technical details. Assigned prediction (probabilities, classifications) for the evaluation data set. Use 0.5 threshold.

training <- read.csv(paste0("https://raw.githubusercontent.com/jzuniga123/",
                            "SPS/master/DATA%20621/crime-training-data.csv"))
evaluation <- read.csv(paste0("https://raw.githubusercontent.com/jzuniga123/",
                            "SPS/master/DATA%20621/crime-evaluation-data.csv"))
X <- training[ , -14]
Y <- training[ , 14]
Z <- evaluation[ , -14]
nonbinary <- c(1:2, 4:13)

Data Exploration

Describe the size and the variables in the crime training data set. Consider that too much detail will cause a manager to lose interest while too little detail will make the manager consider that you aren’t doing your job. Some suggestions are given below. Please do NOT treat this as a check list of things to do to complete the assignment. You should have your own thoughts on what to tell the boss. These are just ideas.

  1. Mean / Standard Deviation / Median
  2. Bar Chart or Box Plot of the data
  3. Is the data correlated to the target variable (or to other variables?)
  4. Are any of the variables missing and need to be imputed/“fixed”?

Numerical Summaries

summary(training, digits = 2)
##        zn          indus            chas            nox      
##  Min.   :  0   Min.   : 0.46   Min.   :0.000   Min.   :0.39  
##  1st Qu.:  0   1st Qu.: 5.14   1st Qu.:0.000   1st Qu.:0.45  
##  Median :  0   Median : 9.69   Median :0.000   Median :0.54  
##  Mean   : 12   Mean   :11.11   Mean   :0.071   Mean   :0.55  
##  3rd Qu.: 16   3rd Qu.:18.10   3rd Qu.:0.000   3rd Qu.:0.62  
##  Max.   :100   Max.   :27.74   Max.   :1.000   Max.   :0.87  
##        rm           age             dis            rad            tax     
##  Min.   :3.9   Min.   :  2.9   Min.   : 1.1   Min.   : 1.0   Min.   :187  
##  1st Qu.:5.9   1st Qu.: 43.9   1st Qu.: 2.1   1st Qu.: 4.0   1st Qu.:281  
##  Median :6.2   Median : 77.2   Median : 3.2   Median : 5.0   Median :334  
##  Mean   :6.3   Mean   : 68.4   Mean   : 3.8   Mean   : 9.5   Mean   :410  
##  3rd Qu.:6.6   3rd Qu.: 94.1   3rd Qu.: 5.2   3rd Qu.:24.0   3rd Qu.:666  
##  Max.   :8.8   Max.   :100.0   Max.   :12.1   Max.   :24.0   Max.   :711  
##     ptratio       black            lstat           medv        target    
##  Min.   :13   Min.   :  0.32   Min.   : 1.7   Min.   : 5   Min.   :0.00  
##  1st Qu.:17   1st Qu.:375.61   1st Qu.: 7.0   1st Qu.:17   1st Qu.:0.00  
##  Median :19   Median :391.34   Median :11.3   Median :21   Median :0.00  
##  Mean   :18   Mean   :357.12   Mean   :12.6   Mean   :23   Mean   :0.49  
##  3rd Qu.:20   3rd Qu.:396.24   3rd Qu.:16.9   3rd Qu.:25   3rd Qu.:1.00  
##  Max.   :22   Max.   :396.90   Max.   :38.0   Max.   :50   Max.   :1.00
all(complete.cases(training))
## [1] TRUE

Looking at the data summaries for the training dataset, we can see that no variables have missing data. Each row of the training dataset contains a complete case. We can also see multiple skewed variables with large maximums relative to their mean and median. Other variables yields nearly identical means and medians.

Variable Scatterplots

par(mfrow = c(3,4))
for (i in nonbinary) { 
  plot(X[,i], main = names(X[i])) 
}

Most of the nonbinary variables do not appear to be normal or random.

Scatterplot Matrix

plot(X[,nonbinary])

The Scatterplot Matrix appears to indicate the existance of relationships between the nonbinary variables.

Histograms, Density Plots

par(mfrow = c(3,4))
for (i in nonbinary) {
  hist(X[ ,i], xlab = names(X[i]), main = names(X[i]))
  d <- density(X[,i])
  plot(d, main = names(X[i]))
  polygon(d, col="red")
}

The histograms and density plots give a better understanding of how the nonbinary data are distributed. The distribution of the variable rm could pass for some variation of a Gaussian or Student-\(t\) distribution. The variables zn, age, dis, ptratio, black, and lstat show heavy skewing. The variables indus, rad, and tax have multimodal distributions.

Correlation Heatmap

library(ggplot2)
library(reshape2)
ggplot(data = melt(abs(cor(training))), aes(x=Var1, y=Var2, fill=value)) + 
  scale_fill_gradient(low = 'black', high = 'red', name = "Absolute Value") +
  geom_tile() + labs(title = "Correlation Heatmap") + 
  theme(axis.title.y = element_blank(),
        axis.title.x = element_blank(), 
        plot.title = element_text(hjust = 0.5))

There are high levels of correlation between multiple predictor variables and the response variable. The variables chas, rm, medv, and ptratio appear to have weak correlations with the response variable.

Principle Component Analysis

PCA <- function(X) {
  Xpca <- prcomp(X, center = T, scale. = T) 
  M <- as.matrix(X); R <- as.matrix(Xpca$rotation); score <- M %*% R
  print(list("Importance of Components" = summary(Xpca)$importance[ ,1:5], 
             "Rotation (Variable Loadings)" = Xpca$rotation[ ,1:5],
             "Correlation between X and PC" = cor(X, score)[ ,1:5]))
  par(mfrow=c(2,3))
  barplot(Xpca$sdev^2, ylab = "Component Variance")
  barplot(cor(cbind(X)), ylab = "Correlations")
  barplot(Xpca$rotation, ylab = "Loadings")  
  biplot(Xpca); barplot(M); barplot(score)
}
PCA(X[, nonbinary])
## $`Importance of Components`
##                             PC1      PC2      PC3       PC4       PC5
## Standard deviation     2.496896 1.236621 1.089458 0.9232161 0.8084919
## Proportion of Variance 0.519540 0.127440 0.098910 0.0710300 0.0544700
## Cumulative Proportion  0.519540 0.646980 0.745890 0.8169100 0.8713900
## 
## $`Rotation (Variable Loadings)`
##                PC1         PC2         PC3         PC4         PC5
## zn       0.2617869  0.10179109 -0.42024848  0.32932260 -0.44849584
## indus   -0.3445800 -0.10979620  0.01624666 -0.02562960 -0.14882012
## nox     -0.3331737 -0.26875570  0.07856781  0.20688255 -0.16935178
## rm       0.2130233 -0.53607535 -0.26097007 -0.19937642  0.13225427
## age     -0.3082582 -0.26412180  0.24864741  0.04212749 -0.01974695
## dis      0.3083629  0.37071984 -0.22879321  0.05489440 -0.05789147
## rad     -0.3031698 -0.07819297 -0.47705930 -0.17008639 -0.21060993
## tax     -0.3288806 -0.04433671 -0.41547848 -0.10233784 -0.28935174
## ptratio -0.2148019  0.33932419 -0.15768029 -0.67741655  0.23936038
## black    0.1959990  0.04073157  0.43670630 -0.45832596 -0.72554324
## lstat   -0.3232014  0.23280836  0.12543285  0.27134698 -0.11713364
## medv     0.2777568 -0.48135626 -0.05621664 -0.14684284 -0.03583048
## 
## $`Correlation between X and PC`
##                PC1         PC2        PC3          PC4         PC5
## zn       0.4573489  0.53867205  0.1746846  0.125418514 -0.08577190
## indus   -0.7858273 -0.73097510 -0.6210287 -0.020502887 -0.13891401
## nox     -0.7351319 -0.76768467 -0.5654169  0.052428758 -0.05776395
## rm       0.3483940  0.06549015  0.2288614  0.009029718  0.05338448
## age     -0.6375446 -0.81029567 -0.3794115 -0.002478123 -0.04751827
## dis      0.6350597  0.78248995  0.4167167  0.026349654  0.02748995
## rad     -0.8824983 -0.74300713 -0.8517591  0.011500273 -0.21941612
## tax     -0.9579570 -0.78307294 -0.9110464 -0.024360439 -0.29499715
## ptratio -0.4874748 -0.27110504 -0.3950028 -0.105792376 -0.12579812
## black    0.6207010  0.53932350  0.7456097 -0.870090627 -0.71581134
## lstat   -0.6602411 -0.45765729 -0.4962276  0.105613824 -0.02746172
## medv     0.5679821  0.21017878  0.4512497 -0.113777300 -0.00807784

Detailed above are the first five Principal Components which account for the majority of the variation as indicated by the Component Variance bar chart. Looking at the absolute magnitude of the correlations between the variables and the primary Principal Component suggests that the main underlying mechanisms explaining the variance of the data, in order of magnitude are tax, rad, and indus which have a focus on property values. The secondary, and less important, Principal Component draws most of its variation in order of magnitude are from age, tax, and dis which hone in on socio economic status of the community residents.

Data Preparation

Describe how you have transformed the data by changing the original variables or creating new variables. If you did transform the data or create new variables, discuss why you did this. Here are some possible transformations.

  1. Fix missing values (maybe with a Mean or Median value)
  2. Create flags to suggest if a variable was missing
  3. Transform data by putting it into buckets
  4. Mathematical transforms such as log or square root (or, use Box-Cox)
  5. Combine variables (such as ratios or adding or multiplying) to create new variables

Missing Values

all(complete.cases(training))
## [1] TRUE

No variables have missing data. Each row of the training data set contains a complete case with no missing values. As a result we do not need to develop methods for immputation going forward.

Correlations

Between \(Y\) and \(X\) Variables

corr_XY <- data.frame(array(NA, dim = c(ncol(X), 5)))
colnames(corr_XY) <- c("Y", "X", "r","p","<0.05")
for (i in 1:ncol(X)) {
  r <- cor.test(Y, X[,i])
  corr_XY[i, 1] <- "target"
  corr_XY[i, 2] <- names(X[i])
  corr_XY[i, 3] <- r$estimate
  corr_XY[i, 4] <- r$p.value
  corr_XY[i, 5] <- corr_XY[i, 4] < 0.05
}; corr_XY
##         Y       X           r            p <0.05
## 1  target      zn -0.43168176 1.415603e-22  TRUE
## 2  target   indus  0.60485074 7.845651e-48  TRUE
## 3  target    chas  0.08004187 8.434811e-02 FALSE
## 4  target     nox  0.72610622 1.680131e-77  TRUE
## 5  target      rm -0.15255334 9.542249e-04  TRUE
## 6  target     age  0.63010625 6.245736e-53  TRUE
## 7  target     dis -0.61867312 1.450176e-50  TRUE
## 8  target     rad  0.62810492 1.647597e-52  TRUE
## 9  target     tax  0.61111331 4.709566e-49  TRUE
## 10 target ptratio  0.25084892 4.052730e-08  TRUE
## 11 target   black -0.35295680 4.062541e-15  TRUE
## 12 target   lstat  0.46912702 7.071431e-27  TRUE
## 13 target    medv -0.27055071 2.924931e-09  TRUE

There is significant correlation between the response variable and nearly all of the predictor variables at an \(\alpha=0.05\) significance level. The predictor variable chas does not have statistically significant correlation with the response variable and is therefore not being considered for the model.

X <- X[ , -which(names(X) %in% "chas")]
Z <- Z[ , -which(names(Z) %in% "chas")]

Between \(X\) Variables

corr_XX <- data.frame(array(NA, dim = c(choose(ncol(X), 2), 5)))
colnames(corr_XX) <- c("X1", "X2", "r","p","<0.05"); k = 1
for (i in 1:(ncol(X) - 1)) {
  for (j in (i+1):ncol(X)) {
    r <- cor.test(X[,i], X[,j])
    corr_XX[k, 1] <- names(X[i])
    corr_XX[k, 2] <- names(X[j])
    corr_XX[k, 3] <- r$estimate
    corr_XX[k, 4] <- r$p.value
    corr_XX[k, 5] <- corr_XX[i, 4] < 0.05
    k = k + 1
  }
}; head(corr_XX); all(corr_XX[,5])
##   X1    X2          r            p <0.05
## 1 zn indus -0.5382664 2.319176e-36  TRUE
## 2 zn   nox -0.5170452 3.238655e-33  TRUE
## 3 zn    rm  0.3198141 1.528592e-12  TRUE
## 4 zn   age -0.5725805 6.040999e-42  TRUE
## 5 zn   dis  0.6601243 1.220397e-59  TRUE
## 6 zn   rad -0.3154812 3.151098e-12  TRUE
## [1] TRUE

There is significant correlation between nearly all of the remaining predictor variables at an \(\alpha=0.05\) significance level. These correlations run the whole gamut from weak, to moderate, to strong.

correlations <- function(data, threshold) {
  columns <- ncol(data)
  cors <- round(cor(data[,-columns]), 5)
  cors[upper.tri(cors, diag = T)] <- NA
  cors <- na.omit(melt(cors))
  abs_cors <- abs(cors["value"])
  cors[abs_cors < 1 & abs_cors > threshold, ]
}
correlations(X, 0.7)
##    Var1  Var2    value
## 14  nox indus  0.75963
## 17  dis indus -0.70362
## 19  tax indus  0.73223
## 27  age   nox  0.73513
## 28  dis   nox -0.76888
## 50  dis   age -0.75090
## 74  tax   rad  0.90646

The variables nox, indus, dis, age, and medv have strong correlations with multiple variables, but the highest correlation of 0.9064632 is between tax and rad. This correlation is so high that it makes sense to exclude one of the variables in the model. Therefore , given that \(cor(\textrm{target},\textrm{rad})=0.6281049\) \(>\) \(cor(\textrm{target}, \textrm{tax})=0.6111133\), it is the predictor variable tax which is the least correlated with the response variable that will not be considered for the model.

X <- X[ , -which(names(X) %in% "tax")]
Z <- Z[ , -which(names(Z) %in% "tax")]

Dependence

depend <- data.frame(array(NA, dim = c(ncol(X), 5)))
colnames(depend) <- c("Y", "X", "chi-stat", "Chi-test","<0.05")
for (i in 1:ncol(X)) {
  chi <- chisq.test(table(Y, X[,i]), simulate.p.value = TRUE)
  depend[i, 1] <- "target"
  depend[i, 2] <- names(X[i])
  depend[i, 3] <- chi$statistic
  depend[i, 4] <- chi$p.value
  depend[i, 5] <- depend[i, 4] < 0.05
}; depend
##         Y       X chi-stat     Chi-test <0.05
## 1  target      zn 124.2333 0.0004997501  TRUE
## 2  target   indus 423.2214 0.0004997501  TRUE
## 3  target     nox 418.0914 0.0004997501  TRUE
## 4  target      rm 433.3237 0.0194902549  TRUE
## 5  target     age 405.7441 0.0004997501  TRUE
## 6  target     dis 455.9971 0.0004997501  TRUE
## 7  target     rad 229.9859 0.0004997501  TRUE
## 8  target ptratio 360.8258 0.0004997501  TRUE
## 9  target   black 342.6660 0.0064967516  TRUE
## 10 target   lstat 415.9853 0.9260369815 FALSE
## 11 target    medv 295.4259 0.0004997501  TRUE

All but one of the \(X\) variables show evidence of statistically significant dependence with \(Y\). The variable lstat that is not statistically significantly dependent will not be included in the model. It is worth noting the function simulated p-values where very small expected values posed the potential for producing incorrect approximations of p. This simulation lead to duplicate p-values in some cases, but the \(\chi^2\) test statistics remained unique.

X <- X[ , -which(names(X) %in% "lstat")]
Z <- Z[ , -which(names(Z) %in% "lstat")]

Logarithmic Transformation

library(MASS)
potential <- match(c("zn","dis","medv"), names(X))
lambda <- numeric(length(potential))
par(mfrow=c(1,3))
for (i in potential) {
  shifted <- X[,i] - min(X[,i]) + 1e-32
  fit_exp <- fitdistr(shifted, "Exponential")
  lambda[i] <- fit_exp$estimate
  exp <- rexp(1000, lambda[i])
  hist(X[,i], prob=TRUE, col="grey", main =names(X[i]), 
       xlab=paste("Lambda =",fractions(lambda[i])))
  lines(density(exp), col="blue", lwd=2)
}

Only two of three potential variables lend themselves toward modeling with an exponential distribution. The variables were shifted to slightly above zero by subtracting the minimum value and then adding \(1^{-32}\) to the modified value. This would also shift data with a negative minimum in the appropriate direction since subtracting the negative minimum value equates to adding the minimum value.

X[, "log_zn"] <- log(X[, "zn"] - min(X[, "zn"]) + 1e-32, lambda[1])
X[, "log_dis"] <- log(X[, "dis"] - min(X[, "dis"]) + 1e-32, lambda[2])
Z[, "log_zn"] <- log(Z[, "zn"] - min(Z[, "zn"]) + 1e-32, lambda[1])
Z[, "log_dis"] <- log(Z[, "dis"] - min(Z[, "dis"]) + 1e-32, lambda[2])

Categorization of Multimodal Data

par(mfrow=c(1,2))
plot(X[,"indus"]); abline(h = 15, col="red"); abline(h = 18, col="blue")
plot(X[,"rad"]); abline(h = 8, col="red"); abline(h = 24, col="blue")

The variables indus and rad predictor variables have bimodal distributions. There are clear lines of demarcation in the values that we can use to bifurcate the variables into categories.

X[,"cat_indus"] <- ifelse(X[,"indus"] < (15 + 18) / 2, 0, 1)
X[,"cat_rad"] <- ifelse(X[,"rad"] < (8 + 24) / 2, 0, 1)
Z[,"cat_indus"] <- ifelse(Z[,"indus"] < (15 + 18) / 2, 0, 1)
Z[,"cat_rad"] <- ifelse(Z[,"rad"] < (8 + 24) / 2, 0, 1)

Box-Cox Transformation

library(car)
potential <- match(c("nox","rm","age","ptratio","black","medv"), names(X))
box.cox.powers <- powerTransform(X[, potential], family="bcPower")
summary(box.cox.powers)
## bcPower Transformations to Multinormality 
##         Est.Power Std.Err. Wald Lower Bound Wald Upper Bound
## nox       -1.4797   0.1966          -1.8650          -1.0944
## rm         0.9578   0.2409           0.4857           1.4299
## age        1.4414   0.0895           1.2658           1.6169
## ptratio    4.3789   0.4112           3.5729           5.1849
## black      3.5877   0.1767           3.2414           3.9341
## medv       0.3632   0.0684           0.2291           0.4974
## 
## Likelihood ratio tests about transformation parameters
##                                                          LRT df      pval
## LR test, lambda = (0 0 0 0 0 0)                 2010.4335224  6 0.0000000
## LR test, lambda = (1 1 1 1 1 1)                  770.6317985  6 0.0000000
## LR test, lambda = (-1.48 1 1.44 4.38 3.59 0.33)    0.3049445  6 0.9994728

Five of the six potential variables lend themselves toward Box-Cox transformation. The variable rm has a power very near one which indicates no transformation. This is further supported by the lower and upper bounds which hover around one.

X[, "box_nox"] <- X[, "nox"]^(box.cox.powers$lambda[1])
X[, "box_age"] <- X[, "age"]^(box.cox.powers$lambda[3])
X[, "box_ptratio"] <- X[, "ptratio"]^(box.cox.powers$lambda[4])
X[, "box_black"] <- X[, "black"]^(box.cox.powers$lambda[5])
X[, "box_medv"] <- X[, "medv"]^(box.cox.powers$lambda[6])
Z[, "box_nox"] <- Z[, "nox"]^(box.cox.powers$lambda[1])
Z[, "box_age"] <- Z[, "age"]^(box.cox.powers$lambda[3])
Z[, "box_ptratio"] <- Z[, "ptratio"]^(box.cox.powers$lambda[4])
Z[, "box_black"] <- Z[, "black"]^(box.cox.powers$lambda[5])
Z[, "box_medv"] <- Z[, "medv"]^(box.cox.powers$lambda[6])

Build Models

Using the training data, build at least three different binary logistic regression models, using different variables (or the same variables with different transformations). You may select the variables manually, use an approach such as Forward or Stepwise, use a different approach, or use a combination of techniques. Describe the techniques you used. If you manually selected a variable for inclusion into the model or exclusion into the model, indicate why this was done.

Be sure to explain how you can make inferences from the model, as well as discuss other relevant model output. Discuss the coefficients in the models, do they make sense? Are you keeping the model even though it is counter intuitive? Why? The boss needs to know.

Forward Selection

Forward stepwise subset selection based on \(AIC\). Using \(k = 2\) degrees of freedom for the penalty gives the genuine \(AIC\). Using \(k = log(n)\) is sometimes referred to as BIC or SBC.

training <- data.frame(X[,c(4, 11:19)], "target" = Y)
null <- glm(target ~ 0, family = binomial(link = "logit"), training)
full <- glm(target ~ ., family = binomial(link = "logit"), training)
aic_steps <- step(null, scope=list(lower=null, upper=full), direction="forward", k = 2)
## Start:  AIC=646.01
## target ~ 0
## 
##               Df Deviance    AIC
## + cat_rad      1   478.27 480.27
## + cat_indus    1   488.09 490.09
## + box_age      1   607.52 609.52
## + log_zn       1   619.68 621.68
## + box_nox      1   622.82 624.82
## + box_black    1   635.03 637.03
## + box_ptratio  1   639.78 641.78
## <none>             646.01 646.01
## + box_medv     1   644.03 646.03
## + rm           1   645.47 647.47
## 
## Step:  AIC=480.27
## target ~ cat_rad - 1
## 
##               Df Deviance    AIC
## + box_nox      1   388.35 392.35
## + box_black    1   412.54 416.54
## + box_medv     1   427.18 431.18
## + rm           1   428.57 432.57
## + box_ptratio  1   441.94 445.94
## + cat_indus    1   459.85 463.85
## + log_zn       1   474.90 478.90
## <none>             478.27 480.27
## + box_age      1   476.99 480.99
## 
## Step:  AIC=392.35
## target ~ cat_rad + box_nox - 1
## 
##               Df Deviance    AIC
## + rm           1   259.19 265.19
## + box_medv     1   266.72 272.72
## + box_age      1   278.80 284.80
## + log_zn       1   328.91 334.91
## + cat_indus    1   338.87 344.87
## + box_black    1   362.75 368.75
## + box_ptratio  1   364.81 370.81
## <none>             388.35 392.35
## 
## Step:  AIC=265.19
## target ~ cat_rad + box_nox + rm - 1
## 
##               Df Deviance    AIC
## + box_age      1   251.68 259.68
## + box_ptratio  1   251.90 259.90
## + log_zn       1   252.43 260.43
## + box_black    1   255.18 263.18
## <none>             259.19 265.19
## + box_medv     1   258.76 266.76
## + cat_indus    1   259.18 267.18
## 
## Step:  AIC=259.68
## target ~ cat_rad + box_nox + rm + box_age - 1
## 
##               Df Deviance    AIC
## + box_black    1   245.84 255.84
## + box_ptratio  1   248.56 258.56
## + log_zn       1   248.93 258.93
## + box_medv     1   249.06 259.06
## <none>             251.68 259.68
## + cat_indus    1   250.82 260.82
## 
## Step:  AIC=255.83
## target ~ cat_rad + box_nox + rm + box_age + box_black - 1
## 
##               Df Deviance    AIC
## + box_medv     1   240.95 252.95
## + box_ptratio  1   241.32 253.32
## + log_zn       1   242.41 254.41
## <none>             245.84 255.84
## + cat_indus    1   244.70 256.70
## 
## Step:  AIC=252.95
## target ~ cat_rad + box_nox + rm + box_age + box_black + box_medv - 
##     1
## 
##               Df Deviance    AIC
## + box_ptratio  1   231.56 245.56
## + log_zn       1   237.83 251.83
## <none>             240.95 252.95
## + cat_indus    1   239.36 253.36
## 
## Step:  AIC=245.55
## target ~ cat_rad + box_nox + rm + box_age + box_black + box_medv + 
##     box_ptratio - 1
## 
##             Df Deviance    AIC
## + cat_indus  1   227.67 243.67
## <none>           231.56 245.56
## + log_zn     1   231.47 247.47
## 
## Step:  AIC=243.67
## target ~ cat_rad + box_nox + rm + box_age + box_black + box_medv + 
##     box_ptratio + cat_indus - 1
## 
##          Df Deviance    AIC
## <none>        227.67 243.67
## + log_zn  1   227.57 245.57

The model target ~ cat_rad + box_nox + rm + box_age + box_black + box_medv + box_ptratio + cat_indus - 1 has the lowest AIC of 243.67.

forward <- glm(target ~ cat_rad + box_nox + rm + box_age + box_black + box_medv +
             box_ptratio + cat_indus - 1, family = binomial(link = "logit"), training)
summary(forward)
## 
## Call:
## glm(formula = target ~ cat_rad + box_nox + rm + box_age + box_black + 
##     box_medv + box_ptratio + cat_indus - 1, family = binomial(link = "logit"), 
##     data = training)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.53887  -0.28220  -0.03420   0.00018   2.94797  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## cat_rad      1.882e+01  8.365e+02   0.022 0.982050    
## box_nox     -3.580e+00  5.755e-01  -6.221 4.93e-10 ***
## rm          -2.151e-01  5.025e-01  -0.428 0.668554    
## box_age      3.141e-03  1.007e-03   3.120 0.001810 ** 
## box_black   -1.747e-09  5.458e-10  -3.201 0.001368 ** 
## box_medv     3.404e+00  1.025e+00   3.320 0.000899 ***
## box_ptratio  3.704e-06  1.131e-06   3.276 0.001053 ** 
## cat_indus   -1.074e+00  5.457e-01  -1.967 0.049133 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 646.01  on 466  degrees of freedom
## Residual deviance: 227.67  on 458  degrees of freedom
## AIC: 243.67
## 
## Number of Fisher Scoring iterations: 18

Using the 8 forward selected variables leads to a model with two insignificant variables: cat_rad and rm. Removing those two insignificant variables causes cat indus to become significant only at a level of \(\alpha=0.1\). If cat indus is also removed, all remaining variable become significant at a level of \(\alpha=0.01\) or lower.

Backward Elimination

Backward stepwise subset elimination based on \(AIC\). Using \(k = 2\) degrees of freedom for the penalty gives the genuine \(AIC\). Using \(k = log(n)\) is sometimes referred to as BIC or SBC.

aic_steps <- step(full, scope=list(lower=null, upper=full), direction="backward", k = 2)
## Start:  AIC=246.2
## target ~ rm + log_zn + log_dis + cat_indus + cat_rad + box_nox + 
##     box_age + box_ptratio + box_black + box_medv
## 
## 
## Step:  AIC=246.2
## target ~ rm + log_zn + cat_indus + cat_rad + box_nox + box_age + 
##     box_ptratio + box_black + box_medv
## 
##               Df Deviance    AIC
## - log_zn       1   226.20 244.20
## - rm           1   226.36 244.36
## <none>             226.20 246.20
## - box_age      1   231.13 249.13
## - cat_indus    1   231.25 249.25
## - box_ptratio  1   232.02 250.02
## - box_medv     1   233.86 251.86
## - box_black    1   240.63 258.63
## - cat_rad      1   258.81 276.81
## - box_nox      1   280.14 298.14
## 
## Step:  AIC=244.2
## target ~ rm + cat_indus + cat_rad + box_nox + box_age + box_ptratio + 
##     box_black + box_medv
## 
##               Df Deviance    AIC
## - rm           1   226.36 242.36
## <none>             226.20 244.20
## - box_age      1   231.24 247.24
## - cat_indus    1   231.29 247.29
## - box_ptratio  1   233.09 249.09
## - box_medv     1   234.11 250.11
## - box_black    1   240.86 256.86
## - cat_rad      1   258.83 274.83
## - box_nox      1   282.45 298.45
## 
## Step:  AIC=242.36
## target ~ cat_indus + cat_rad + box_nox + box_age + box_ptratio + 
##     box_black + box_medv
## 
##               Df Deviance    AIC
## <none>             226.36 242.36
## - cat_indus    1   231.43 245.43
## - box_age      1   231.49 245.49
## - box_ptratio  1   233.16 247.16
## - box_black    1   240.86 254.86
## - box_medv     1   247.76 261.76
## - cat_rad      1   259.85 273.85
## - box_nox      1   284.52 298.52

The model target ~ cat_indus + cat_rad + box_nox + box_age + box_ptratio + box_black + box_medv has the lowest AIC of 242.36.

backward <- glm(target ~ cat_indus + cat_rad + box_nox + box_age + box_ptratio + 
             box_black + box_medv, family = binomial(link = "logit"), training)
summary(backward)
## 
## Call:
## glm(formula = target ~ cat_indus + cat_rad + box_nox + box_age + 
##     box_ptratio + box_black + box_medv, family = binomial(link = "logit"), 
##     data = training)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.69233  -0.26840  -0.03364   0.00017   2.98810  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  3.140e+00  2.604e+00   1.206  0.22799    
## cat_indus   -1.299e+00  5.814e-01  -2.234  0.02547 *  
## cat_rad      1.865e+01  8.349e+02   0.022  0.98218    
## box_nox     -3.927e+00  6.446e-01  -6.091 1.12e-09 ***
## box_age      2.346e-03  1.050e-03   2.235  0.02541 *  
## box_ptratio  3.022e-06  1.176e-06   2.569  0.01020 *  
## box_black   -1.965e-09  6.169e-10  -3.185  0.00145 ** 
## box_medv     2.610e+00  6.155e-01   4.241 2.22e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 645.88  on 465  degrees of freedom
## Residual deviance: 226.36  on 458  degrees of freedom
## AIC: 242.36
## 
## Number of Fisher Scoring iterations: 18

Using the 8 backward eliminated variables leads to a model with two insignificant variables: cat_rad and the intercept. Removing those two insignificant variables causes cat indus to become significant only at a level of \(\alpha=0.1\). If cat indus is also removed, all remaining variable become significant at a level of \(\alpha=0.01\) or lower. The model derived after culling significant variables from the backward elimination is the same model derived if significant variables were culled from the Forward Selection process.

Adjusted \(R^2\)

library(leaps)
(model_sum <- summary(regsubsets(target ~ ., training)))
## Reordering variables and trying again:
## Subset selection object
## Call: regsubsets.formula(target ~ ., training)
## 10 Variables  (and intercept)
##             Forced in Forced out
## rm              FALSE      FALSE
## log_zn          FALSE      FALSE
## cat_indus       FALSE      FALSE
## cat_rad         FALSE      FALSE
## box_nox         FALSE      FALSE
## box_age         FALSE      FALSE
## box_ptratio     FALSE      FALSE
## box_black       FALSE      FALSE
## box_medv        FALSE      FALSE
## log_dis         FALSE      FALSE
## 1 subsets of each size up to 9
## Selection Algorithm: exhaustive
##          rm  log_zn log_dis cat_indus cat_rad box_nox box_age box_ptratio
## 1  ( 1 ) " " " "    " "     " "       " "     "*"     " "     " "        
## 2  ( 1 ) " " " "    " "     " "       "*"     "*"     " "     " "        
## 3  ( 1 ) " " " "    " "     " "       "*"     "*"     " "     " "        
## 4  ( 1 ) " " " "    " "     " "       "*"     "*"     "*"     " "        
## 5  ( 1 ) " " " "    " "     " "       "*"     "*"     "*"     " "        
## 6  ( 1 ) " " " "    " "     " "       "*"     "*"     "*"     "*"        
## 7  ( 1 ) " " " "    " "     "*"       "*"     "*"     "*"     "*"        
## 8  ( 1 ) "*" " "    " "     "*"       "*"     "*"     "*"     "*"        
## 9  ( 1 ) "*" "*"    " "     "*"       "*"     "*"     "*"     "*"        
##          box_black box_medv
## 1  ( 1 ) " "       " "     
## 2  ( 1 ) " "       " "     
## 3  ( 1 ) " "       "*"     
## 4  ( 1 ) " "       "*"     
## 5  ( 1 ) "*"       "*"     
## 6  ( 1 ) "*"       "*"     
## 7  ( 1 ) "*"       "*"     
## 8  ( 1 ) "*"       "*"     
## 9  ( 1 ) "*"       "*"
plot(model_sum$adjr2, xlab = "Number of Variables", ylab = "Adj R-squared")

max(model_sum$adjr2)
## [1] 0.6221241
which.max(model_sum$adjr2)
## [1] 7

The maximum Adjusted \(R^2\) of 0.6221241 is reached when the model contains 7 variables. The variables in this model are nearly identical to the model derived using Backward Elimination. The sole difference is the inclusion of the intercept in this model.

model_sum$which[which.max(model_sum$adjr2), ]
## (Intercept)          rm      log_zn     log_dis   cat_indus     cat_rad 
##        TRUE       FALSE       FALSE       FALSE        TRUE        TRUE 
##     box_nox     box_age box_ptratio   box_black    box_medv 
##        TRUE        TRUE        TRUE        TRUE        TRUE
adjustedr2 <- glm(target ~ 1 + cat_indus + cat_rad + box_nox + box_age + 
            box_ptratio + box_black + box_medv, family = binomial(link = "logit"), training)
summary(adjustedr2)
## 
## Call:
## glm(formula = target ~ 1 + cat_indus + cat_rad + box_nox + box_age + 
##     box_ptratio + box_black + box_medv, family = binomial(link = "logit"), 
##     data = training)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.69233  -0.26840  -0.03364   0.00017   2.98810  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  3.140e+00  2.604e+00   1.206  0.22799    
## cat_indus   -1.299e+00  5.814e-01  -2.234  0.02547 *  
## cat_rad      1.865e+01  8.349e+02   0.022  0.98218    
## box_nox     -3.927e+00  6.446e-01  -6.091 1.12e-09 ***
## box_age      2.346e-03  1.050e-03   2.235  0.02541 *  
## box_ptratio  3.022e-06  1.176e-06   2.569  0.01020 *  
## box_black   -1.965e-09  6.169e-10  -3.185  0.00145 ** 
## box_medv     2.610e+00  6.155e-01   4.241 2.22e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 645.88  on 465  degrees of freedom
## Residual deviance: 226.36  on 458  degrees of freedom
## AIC: 242.36
## 
## Number of Fisher Scoring iterations: 18

Mallows \(C_p\)

plot(model_sum$cp, xlab = "Number of Variables", ylab = "Mallows Cp")

min(model_sum$cp)
## [1] 5.378017
which.min(model_sum$cp)
## [1] 6

The minimum Mallows \(C_p\) of 5.3780169 is reached when the model contains 6 variables.

model_sum$which[which.min(model_sum$cp), ]
## (Intercept)          rm      log_zn     log_dis   cat_indus     cat_rad 
##        TRUE       FALSE       FALSE       FALSE       FALSE        TRUE 
##     box_nox     box_age box_ptratio   box_black    box_medv 
##        TRUE        TRUE        TRUE        TRUE        TRUE
mallowscp <- glm(target ~ 1 + cat_rad + box_nox + box_age + box_ptratio + 
             box_black + box_medv, family = binomial(link = "logit"), training)
summary(mallowscp)
## 
## Call:
## glm(formula = target ~ 1 + cat_rad + box_nox + box_age + box_ptratio + 
##     box_black + box_medv, family = binomial(link = "logit"), 
##     data = training)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.43901  -0.32234  -0.05351   0.00018   2.81144  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  1.303e+00  2.468e+00   0.528  0.59750    
## cat_rad      1.793e+01  8.530e+02   0.021  0.98323    
## box_nox     -3.262e+00  5.210e-01  -6.261 3.82e-10 ***
## box_age      1.987e-03  1.019e-03   1.951  0.05110 .  
## box_ptratio  2.823e-06  1.184e-06   2.384  0.01713 *  
## box_black   -1.730e-09  5.872e-10  -2.947  0.00321 ** 
## box_medv     2.524e+00  6.041e-01   4.178 2.94e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 645.88  on 465  degrees of freedom
## Residual deviance: 231.43  on 459  degrees of freedom
## AIC: 245.43
## 
## Number of Fisher Scoring iterations: 18

Select Models

Decide on the criteria for selecting the best binary logistic regression model. Will you select models with slightly worse performance if it makes more sense or is more parsimonious? Discuss why you selected your model.

For the binary logistic regression model, will you use a metric such as log likelihood, AIC, ROC curve, etc.? Using the training data set, evaluate the binary logistic regression model based on (a) accuracy, (b) classification error rate, (c) precision (Pos Pred Value), (d) sensitivity, (e) specificity, (f) F1 score, (g) AUC, and (h) confusion matrix.

Confusion Matrix

library(caret)
training[ ,"probability.forward"] <- predict(forward, training, type="response")
training[ ,"class.forward"] <- ifelse(training$probability.forward < 0.5, 0, 1)
(cm1 <- confusionMatrix(training$class.forward, training$target, positive = "1"))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 211  32
##          1  26 197
##                                           
##                Accuracy : 0.8755          
##                  95% CI : (0.8421, 0.9041)
##     No Information Rate : 0.5086          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.7509          
##  Mcnemar's Test P-Value : 0.5115          
##                                           
##             Sensitivity : 0.8603          
##             Specificity : 0.8903          
##          Pos Pred Value : 0.8834          
##          Neg Pred Value : 0.8683          
##              Prevalence : 0.4914          
##          Detection Rate : 0.4227          
##    Detection Prevalence : 0.4785          
##       Balanced Accuracy : 0.8753          
##                                           
##        'Positive' Class : 1               
## 

The model derived using Forward Selection has the following performance metrics: Accuracy of 0.8755365, Error Rate of 0.1244635, Precision of 0.8834081, Sensitivity of 0.860262, Specificity of 0.8902954, and \(F_1\) Score of 0.8716814.

training[ ,"probability.backward"] <- predict(backward, training, type="response")
training[ ,"class.backward"] <- ifelse(training$probability.backward < 0.5, 0, 1)
(cm2 <- confusionMatrix(training$class.backward, training$target, positive = "1"))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 213  33
##          1  24 196
##                                          
##                Accuracy : 0.8777         
##                  95% CI : (0.8444, 0.906)
##     No Information Rate : 0.5086         
##     P-Value [Acc > NIR] : <2e-16         
##                                          
##                   Kappa : 0.7551         
##  Mcnemar's Test P-Value : 0.2893         
##                                          
##             Sensitivity : 0.8559         
##             Specificity : 0.8987         
##          Pos Pred Value : 0.8909         
##          Neg Pred Value : 0.8659         
##              Prevalence : 0.4914         
##          Detection Rate : 0.4206         
##    Detection Prevalence : 0.4721         
##       Balanced Accuracy : 0.8773         
##                                          
##        'Positive' Class : 1              
## 

The model derived using Backward Elimination has the following performance metrics: Accuracy of 0.8776824, Error Rate of 0.1223176, Precision of 0.8909091, Sensitivity of 0.8558952, Specificity of 0.8987342, and \(F_1\) Score of 0.8730512.

training[ ,"probability.adjustedr2"] <- predict(adjustedr2, training, type="response")
training[ ,"class.adjustedr2"] <- ifelse(training$probability.adjustedr2 < 0.5, 0, 1)
(cm3 <- confusionMatrix(training$class.adjustedr2, training$target, positive = "1"))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 213  33
##          1  24 196
##                                          
##                Accuracy : 0.8777         
##                  95% CI : (0.8444, 0.906)
##     No Information Rate : 0.5086         
##     P-Value [Acc > NIR] : <2e-16         
##                                          
##                   Kappa : 0.7551         
##  Mcnemar's Test P-Value : 0.2893         
##                                          
##             Sensitivity : 0.8559         
##             Specificity : 0.8987         
##          Pos Pred Value : 0.8909         
##          Neg Pred Value : 0.8659         
##              Prevalence : 0.4914         
##          Detection Rate : 0.4206         
##    Detection Prevalence : 0.4721         
##       Balanced Accuracy : 0.8773         
##                                          
##        'Positive' Class : 1              
## 

The model derived using Adjusted \(R^2\) has the following performance metrics: Accuracy of 0.8776824, Error Rate of 0.1223176, Precision of 0.8909091, Sensitivity of 0.8558952, Specificity of 0.8987342, and \(F_1\) Score of 0.8730512. These metrics are identical to those from Backward Elimination since, as previously mentioned, both models are nearly idencial exept for the intercept.

training[ ,"probability.mallowscp"] <- predict(mallowscp, training, type="response")
training[ ,"class.mallowscp"] <- ifelse(training$probability.mallowscp < 0.5, 0, 1)
(cm4 <- confusionMatrix(training$class.mallowscp, training$target, positive = "1"))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 215  36
##          1  22 193
##                                           
##                Accuracy : 0.8755          
##                  95% CI : (0.8421, 0.9041)
##     No Information Rate : 0.5086          
##     P-Value [Acc > NIR] : < 2e-16         
##                                           
##                   Kappa : 0.7507          
##  Mcnemar's Test P-Value : 0.08783         
##                                           
##             Sensitivity : 0.8428          
##             Specificity : 0.9072          
##          Pos Pred Value : 0.8977          
##          Neg Pred Value : 0.8566          
##              Prevalence : 0.4914          
##          Detection Rate : 0.4142          
##    Detection Prevalence : 0.4614          
##       Balanced Accuracy : 0.8750          
##                                           
##        'Positive' Class : 1               
## 

The model derived using Mallows \(C_p\) has the following performance metrics: Accuracy of 0.8755365, Error Rate of 0.1244635, Precision of 0.8976744, Sensitivity of 0.8427948, Specificity of 0.907173, and \(F_1\) Score of 0.8693694. This model has the same Accuracy as the model derived using Forward Selection, but with better Precision, Specificity, and \(F_1\) Score values. It does have a lower Sensitivity value however.

ROC Curve

library(pROC)
par(mfrow=c(2,2))
plot(roc(training$target, training$class.forward, smooth=F), print.auc=TRUE)
plot(roc(training$target, training$class.backward, smooth=F), print.auc=TRUE)
plot(roc(training$target, training$probability.adjustedr2, smooth=F), print.auc=TRUE)
plot(roc(training$target, training$probability.mallowscp, smooth=F), print.auc=TRUE)

The models with the highest Accuracy were the Backward Elimination and the Adjusted \(R^2\) models. These models were nearly identical except that the latter explicitly calls for the intercept. This small difference gives the latter model a greater Area Under the ROC Curve which was the reason for selecting the latter model as the final model.

Predictions

Make predictions using the evaluation data set.

evaluation <- Z[,c(4, 11:19)]
evaluation[, "pred.prob"] <- predict(adjustedr2, evaluation, type="response")
evaluation[, "pred.class"] <- ifelse(evaluation[, 11] >= 0.5, 1, 0); 
evaluation[, c("pred.prob", "pred.class")]
##      pred.prob pred.class
## 1  0.163154389          0
## 2  0.629665732          1
## 3  0.629928820          1
## 4  0.929447056          1
## 5  0.088687899          0
## 6  0.025469918          0
## 7  0.055673427          0
## 8  0.038245612          0
## 9  0.020602008          0
## 10 0.020450669          0
## 11 0.535389993          1
## 12 0.545614949          1
## 13 0.567608202          1
## 14 0.964682451          1
## 15 0.674913590          1
## 16 0.180760734          0
## 17 0.253280855          0
## 18 0.814707701          1
## 19 0.017192104          0
## 20 0.001858639          0
## 21 0.001836118          0
## 22 0.106569894          0
## 23 0.295344006          0
## 24 0.128110066          0
## 25 0.087140207          0
## 26 0.190403073          0
## 27 0.007822306          0
## 28 1.000000000          1
## 29 1.000000000          1
## 30 1.000000000          1
## 31 1.000000000          1
## 32 0.999999998          1
## 33 0.999999997          1
## 34 0.999999999          1
## 35 1.000000000          1
## 36 1.000000000          1
## 37 0.999999999          1
## 38 0.999999997          1
## 39 0.541989846          1
## 40 0.851430890          1

Prevalence

n <- sum(training$target)
N <- nrow(training)
m <- sum(evaluation$pred.class)
M <- nrow(evaluation)
binom.test(m, M, n / N)
## 
##  Exact binomial test
## 
## data:  m and M
## number of successes = 22, number of trials = 40, p-value = 0.528
## alternative hypothesis: true probability of success is not equal to 0.4914163
## 95 percent confidence interval:
##  0.3849068 0.7074116
## sample estimates:
## probability of success 
##                   0.55

The prevalence of the positive condition is 49.14% in the training data and 55% in the evaluation data results. Although there is some difference in these figures, the difference is not significant at an \(\alpha = 0.05\) as can be seen in the above Binomial test.

References

https://rpubs.com/josezuniga/253955

https://rpubs.com/josezuniga/234598

https://archive.ics.uci.edu/ml/datasets/Housing

http://www.statmethods.net/stats/regression.html

https://onlinecourses.science.psu.edu/stat505/node/54

http://www.instantr.com/2012/11/06/performing-a-binomial-test/

https://cran.r-project.org/web/packages/fitdistrplus/vignettes/paper2JSS.pdf

http://stackoverflow.com/questions/6782070/display-correlation-tables-as-descending-list

https://sites.google.com/site/econometricsacademy/econometrics-models/probit-and-logit-models

http://stats.stackexchange.com/questions/132652/how-to-determine-which-distribution-fits-my-data-best

http://info.salford-systems.com/blog/bid/337783/Why-Data-Scientists-Split-Data-into-Train-and-Test