{r setup, include=FALSE, echo=FALSE} knitr::opts_chunk$set(message=FALSE, warning=FALSE) # Load libraries library(caret) library(corrplot) library(dplyr) library(ggplot2) library(knitr) library(MASS) library(tidyr) library(MLmetrics)

Executive Overview

In this homework assignment, you will explore, analyze and model a data set containing information on crime for various neighborhoods of a major city. Each record has a response variable indicating whether or not the crime rate is above the median crime rate (1) or not (0).

Your objective is to build a binary logistic regression model on the training data set to predict whether the neighborhood will be at risk for high crime levels.

Below is a short description of the variables of interest in the data set:

Variable Description
zn proportion of residential land zoned for large lots (over 25000 square feet) (predictor variable)
indus proportion of non-retail business acres per suburb (predictor variable)
chas a dummy var. for whether the suburb borders the Charles River (1) or not (0) (predictor variable)
nox nitrogen oxides concentration (parts per 10 million) (predictor variable)
rm average number of rooms per dwelling (predictor variable)
age proportion of owner-occupied units built prior to 1940 (predictor variable)
dis weighted mean of distances to five Boston employment centers (predictor variable)
rad index of accessibility to radial highways (predictor variable)
tax full-value property-tax rate per $10,000 (predictor variable)
ptratio pupil-teacher ratio by town (predictor variable)
lstat lower status of the population (percent) (predictor variable)
medv median value of owner-occupied homes in $1000s (predictor variable)
target whether the crime rate is above the median crime rate (1) or not (0) (response variable)

Create train and test sets using the caret machine learning package:

Only use the train data frame until the very end of the process, when we use test to evaluate how effective the model is!

df <- read.csv('crime-training-data_modified.csv', stringsAsFactors=FALSE)
set.seed(1804)
#80% train, 20% test split
train_ix <- createDataPartition(df$target, p=0.8, list=FALSE) 
train <- df[train_ix, ]
test <- df[-train_ix, ]
rm(df)

Data Exploration

Below is descriptive statistic of the variables. There are no NA values.

(train.summary <- data.frame(unclass(summary(train)), row.names = NULL))

There don’t seem to be any outliers or missing data, so we will proceed directly to examining the variables. First, histograms of each variable for each target class:

train %>%
  gather(-target, key='variable', value='value') %>%
  ggplot(aes(x=value, group=target, color=target)) +
    facet_wrap(~ variable, scales='free') +
    geom_density()

Most variables have distinct shapes for each target class. chas and zn are quite skewed, and do not appear terribly informative. indus and tax have two peaks for target = 1, indicating there are two separate processes at work there.

ggplot(train, aes(x=jitter(nox), y=jitter(as.numeric(target)))) + 
  geom_point() + 
  geom_smooth() + 
  geom_smooth(method='lm', color='red')

It is to be expected that many of these variables will be correlated with each other:

{r, fig.width=10} corrplot(cor(train), type='upper', method='number', order='hclust')

Obviously, the concentration of industry is strongly and positively correlated with nitrogen oxide concentration \(\rho= 0.78\)). Parent-teacher ratio is negatively correlated with median property values (\(\rho = -0.5\)), and positively correlated with property taxes (\(\rho = 0.49\)). What these and other variables are really getting at is economic class. Each measures a different phenomenon, but can be conceived of as operationalizing one thing. This suggests PCA may be useful on this dataset.

Checking for interactions

Given the high correlation between the variables, it may be the case that there are numerous interactions that can improve our modeling. In this section, we attempt to determine if this is the case. We will group numeric variables by membership in quartile, and examine line plots.

calc_percentile <- function(x){
  trunc(rank(x)) / length(x)
}

Data Preparation

There doesn’t seem to be any missing data or any obvious outliers.

Modeling

Function to calculate McFadden’s pseudo-\(R^2\) for logistic models:

calc_r2 <- function(model) {
  1 - model$deviance / model$null.deviance
}

\(M_0\): Dummy model

Baseline model, which just predicts the class proportion, which is nearly balanced between the two classes. If we are having trouble improving on this model, we know we are doing something wrong.

This dummy model has an accuracy of about 0.50, sensitivity of 1, and specificity of 0. Since it has zero predictive power, we know that it has a pseudo-\(R^2\) of 0.

m_0 <- glm(target ~ 1, train, family=binomial())
pred_0 <- factor(round(predict(m_0, train, type='response')), levels=c('0', '1'))
confusionMatrix(data=pred_0, reference=factor(train$target, levels=c('0', '1')))
m_0 #null deviance and residual deviance are the same --> R2 is zero

\(M_1\): Full model

The next simplest model uses all available data, without transformations or interactions or polynomials:

m_1 <- glm(target ~ zn + indus + chas + nox + rm + age + dis + rad + tax + ptratio + lstat + medv, train, family=binomial())
pred_1 <- factor(round(predict(m_1, train, type='response')), levels=c('0', '1'))
calc_r2(m_1)
confusionMatrix(data=pred_1, reference=factor(train$target, levels=c('0', '1')))

\(M_2\): Stepwise variable selection with interactions

https://www.theanalysisfactor.com/interpreting-interactions-in-regression/

We know that variable interaction is probably likely. We can automatically test all interactions using stepwise selection:

Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred Explanation found: https://stats.stackexchange.com/questions/11109/how-to-deal-with-perfect-separation-in-logistic-regression “If you have a variable which perfectly separates zeroes and ones in target variable, R will yield the following”perfect or quasi perfect separation" “We still get the model but the coefficient estimates are inflated.” We are suppressing the warning.

{r warning=FALSE} #https://stat.ethz.ch/R-manual/R-patched/library/MASS/html/stepAIC.html m_2 <- stepAIC(m_1, trace=0, scope=list(upper = ~ zn * indus * chas * nox * rm * age * dis * rad * tax * ptratio * lstat*medv, lower= ~1)) pred_2 <- factor(round(predict(m_2, train, type='response')), levels=c('0', '1')) calc_r2(m_2) confusionMatrix(data=pred_2, reference=factor(train$target, levels=c('0', '1')))

summary(m_2)

However, this model is probably overfit. By common heuristic, we have enough data for:

min(table(train$target)) / 15

i.e., 12 variables.

\(M_3\): Adjusting for multiple significance tests

To correct for this overfitting, we will use the p.adjust function to revise our p-values, and then use those that remain significant at \(p = 0.05\) for the next model:

m_2_p <- summary(m_2)$coefficients[,4] 
sort(p.adjust(m_2_p))

Using the top values (including any variable as well as interaction effect:

m_3 <- glm(target ~ age*rm + rad + age*medv, train, family=binomial())
pred_3 <- factor(round(predict(m_3, train, type='response')), levels=c('0', '1'))
calc_r2(m_3)
confusionMatrix(data=pred_3, reference=factor(train$target, levels=c('0', '1')))

The pseudo-\(R^2\) is naturally much less than the overfit \(M_2\). Presumably, it will be better fit to the hold-out sample, however. We do see that sensitivity, specificity, and pos/neg predictive value are actually still pretty strong. As expected and required, all variables are extremely significant.

summary(m_3)

\(M_4\): Previous model + a few more predictors

We noted above that we have data for up to 12 variables in this model, so I will include the first 12 significant variables of the p-value adjustment:

{r warning=FALSE} m_4 <- glm(target ~ age*rm + rad + age*medv + indus*tax + dis*tax + nox*age + zn, train, family=binomial()) pred_4 <- factor(round(predict(m_4, train, type='response')), levels=c('0', '1')) calc_r2(m_4) confusionMatrix(data=pred_4, reference=factor(train$target, levels=c('0', '1')))

Despite adding all these variables, we see that the confusion matrix evaluations are not that much higher. Pseudo-\(R^2\) did take a nice bump, though. Nonetheless, it is possible that this model does not fit the hold out sample as well as \(M_3\).

summary(m_4)

\(M_5\): PCA

pca <- prcomp(train[,1:12], retx=TRUE, center=TRUE, scale=TRUE)
summary(pca)

The first five account for 87 percent of variation, so we will use those for modeling:

pca_df <- as.data.frame(cbind(train$target, pca$x[,1:5]))
colnames(pca_df) <- c('target', 'PC1', 'PC2' ,'PC3', 'PC4', 'PC5')
m_5 <- glm(target ~ ., pca_df, family=binomial())
pred_5 <- factor(round(predict(m_5, pca_df, type='response')), levels=c('0', '1'))
confusionMatrix(data=pred_5, reference=factor(train$target, levels=c('0', '1')))
calc_r2(m_5)

This model has similar confusion matrix evaluation values as some models above, though it’s pseudo-\(R^2\) value is a bit low.

The results of this exercise with PCA seem to suggests there are three separate ‘clusters’ of phenomenon that affect crime level, at least at a statistically significant level. All three are negative related.

summary(m_5)

Summary of Psuedo-\(R^2\) and Accuracy of Different Models on Training Data

Model pseudo-\(R^2\) Accuracy Description
m_0 0 0.5013 Dummy model
m_1 0.7216759 0.9115 Full Model
m_2 0.9022719 0.9759 Stepwise variable selection with interactions
m_3 0.609916 0.8686 Adjusting m_2 for multiple significance tests
m_4 0.7624825 0.9276 m_3 plus few more predictors
m_5 0.5806149 0.8365 PCA

Evaluating the Models on the Test Set

# Don't run until the very end
# confusionMatrix(data=predict(model, test), reference=test$target)
# Evaluate on F1 score
# For PCA prediction:
# pred_xx <- factor(round(predict(m_5, as.data.frame(predict(pca, newdata=test)), type='response')), levels=c('0', '1'))

Model m_1 on the test set:

pred_test_1 <- factor(round(predict(m_1, test, type='response')), levels=c('0', '1'))
confusionMatrix(data=pred_test_1, reference=factor(test$target, levels=c('0', '1')))
F1_Score(y_pred = pred_test_1, y_true = test$target, positive = "0")

Model m_2 on the test set:

pred_test_2 <- factor(round(predict(m_2, test, type='response')), levels=c('0', '1'))
confusionMatrix(data=pred_test_2, reference=factor(test$target, levels=c('0', '1')))
F1_Score(y_pred = pred_test_2, y_true = test$target, positive = "0")

Model m_3 on the test set:

pred_test_3 <- factor(round(predict(m_3, test, type='response')), levels=c('0', '1'))
confusionMatrix(data=pred_test_3, reference=factor(test$target, levels=c('0', '1')))
F1_Score(y_pred = pred_test_3, y_true = test$target, positive = "0")

Model m_4 on the test set:

pred_test_4 <- factor(round(predict(m_4, test, type='response')), levels=c('0', '1'))
confusionMatrix(data=pred_test_4, reference=factor(test$target, levels=c('0', '1')))
F1_Score(y_pred = pred_test_4, y_true = test$target, positive = "0")

Model m_5 on the test set:

pca_test <- prcomp(test[,1:12], retx=TRUE, center=TRUE, scale=TRUE)
pca_test_df <- as.data.frame(cbind(test$target, pca_test$x[,1:5]))
colnames(pca_test_df) <- c('target', 'PC1', 'PC2' ,'PC3', 'PC4', 'PC5')
pred_test_5 <- factor(round(predict(m_5, pca_test_df, type='response')), levels=c('0', '1'))
confusionMatrix(data=pred_test_5, reference=factor(test$target, levels=c('0', '1')))
F1_Score(y_pred = pred_test_5, y_true = test$target, positive = "0")

Summary of Accuracy and F1 Score on Test Set

Model Accuracy F1 Score Description
m_1 0.9032 0.9126214 Full Model
m_2 0.9462 0.950495 Stepwise variable selection with interactions
m_3 0.828 0.84 Adjusting m_2 for multiple significance tests
m_4 0.8925 0.9056604 m_3 plus few more predictors
m_5 0.2258 0.1818182 PCA

Analysis of Final Model