Executive Overview
Data Exploration
- Checking for interactions
Data Preparation
Modeling
Summary of Psuedo-$R^2$ and Accuracy of Different Models on Training Data
Evaluating the Models on the Test Set
Summary of Accuracy and F1 Score on Test Set
- Analysis of Final Model

{r setup, include=FALSE, echo=FALSE} knitr::opts_chunk$set(message=FALSE, warning=FALSE) # Load libraries library(caret) library(corrplot) library(dplyr) library(ggplot2) library(knitr) library(MASS) library(tidyr) library(MLmetrics)

Executive Overview

In this homework assignment, you will explore, analyze and model a data set containing information on crime for various neighborhoods of a major city. Each record has a response variable indicating whether or not the crime rate is above the median crime rate (1) or not (0).

Your objective is to build a binary logistic regression model on the training data set to predict whether the neighborhood will be at risk for high crime levels.

Below is a short description of the variables of interest in the data set:

Variable	Description
zn	proportion of residential land zoned for large lots (over 25000 square feet) (predictor variable)
indus	proportion of non-retail business acres per suburb (predictor variable)
chas	a dummy var. for whether the suburb borders the Charles River (1) or not (0) (predictor variable)
nox	nitrogen oxides concentration (parts per 10 million) (predictor variable)
rm	average number of rooms per dwelling (predictor variable)
age	proportion of owner-occupied units built prior to 1940 (predictor variable)
dis	weighted mean of distances to five Boston employment centers (predictor variable)
rad	index of accessibility to radial highways (predictor variable)
tax	full-value property-tax rate per $10,000 (predictor variable)
ptratio	pupil-teacher ratio by town (predictor variable)
lstat	lower status of the population (percent) (predictor variable)
medv	median value of owner-occupied homes in $1000s (predictor variable)
target	whether the crime rate is above the median crime rate (1) or not (0) (response variable)

Create train and test sets using the caret machine learning package:

Only use the train data frame until the very end of the process, when we use test to evaluate how effective the model is!

df <- read.csv('crime-training-data_modified.csv', stringsAsFactors=FALSE)
set.seed(1804)
#80% train, 20% test split
train_ix <- createDataPartition(df$target, p=0.8, list=FALSE) 
train <- df[train_ix, ]
test <- df[-train_ix, ]
rm(df)

Data Exploration

Below is descriptive statistic of the variables. There are no NA values.

(train.summary <- data.frame(unclass(summary(train)), row.names = NULL))

There don’t seem to be any outliers or missing data, so we will proceed directly to examining the variables. First, histograms of each variable for each target class:

train %>%
  gather(-target, key='variable', value='value') %>%
  ggplot(aes(x=value, group=target, color=target)) +
    facet_wrap(~ variable, scales='free') +
    geom_density()

Most variables have distinct shapes for each target class. chas and zn are quite skewed, and do not appear terribly informative. indus and tax have two peaks for target = 1, indicating there are two separate processes at work there.

ggplot(train, aes(x=jitter(nox), y=jitter(as.numeric(target)))) + 
  geom_point() + 
  geom_smooth() + 
  geom_smooth(method='lm', color='red')

It is to be expected that many of these variables will be correlated with each other:

{r, fig.width=10} corrplot(cor(train), type='upper', method='number', order='hclust')

Obviously, the concentration of industry is strongly and positively correlated with nitrogen oxide concentration $\rho= 0.78$). Parent-teacher ratio is negatively correlated with median property values ($\rho = -0.5$), and positively correlated with property taxes ($\rho = 0.49$). What these and other variables are really getting at is economic class. Each measures a different phenomenon, but can be conceived of as operationalizing one thing. This suggests PCA may be useful on this dataset.

Checking for interactions

Given the high correlation between the variables, it may be the case that there are numerous interactions that can improve our modeling. In this section, we attempt to determine if this is the case. We will group numeric variables by membership in quartile, and examine line plots.

calc_percentile <- function(x){
  trunc(rank(x)) / length(x)
}

Data Preparation

There doesn’t seem to be any missing data or any obvious outliers.

Modeling

Function to calculate McFadden’s pseudo-$R^2$ for logistic models:

calc_r2 <- function(model) {
  1 - model$deviance / model$null.deviance
}

$M_0$: Dummy model

Baseline model, which just predicts the class proportion, which is nearly balanced between the two classes. If we are having trouble improving on this model, we know we are doing something wrong.

This dummy model has an accuracy of about 0.50, sensitivity of 1, and specificity of 0. Since it has zero predictive power, we know that it has a pseudo-$R^2$ of 0.

m_0 <- glm(target ~ 1, train, family=binomial())
pred_0 <- factor(round(predict(m_0, train, type='response')), levels=c('0', '1'))
confusionMatrix(data=pred_0, reference=factor(train$target, levels=c('0', '1')))
m_0 #null deviance and residual deviance are the same --> R2 is zero

$M_1$: Full model

The next simplest model uses all available data, without transformations or interactions or polynomials:

m_1 <- glm(target ~ zn + indus + chas + nox + rm + age + dis + rad + tax + ptratio + lstat + medv, train, family=binomial())
pred_1 <- factor(round(predict(m_1, train, type='response')), levels=c('0', '1'))
calc_r2(m_1)
confusionMatrix(data=pred_1, reference=factor(train$target, levels=c('0', '1')))

$M_2$: Stepwise variable selection with interactions

https://www.theanalysisfactor.com/interpreting-interactions-in-regression/

We know that variable interaction is probably likely. We can automatically test all interactions using stepwise selection:

Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred Explanation found: https://stats.stackexchange.com/questions/11109/how-to-deal-with-perfect-separation-in-logistic-regression “If you have a variable which perfectly separates zeroes and ones in target variable, R will yield the following”perfect or quasi perfect separation" “We still get the model but the coefficient estimates are inflated.” We are suppressing the warning.

{r warning=FALSE} #https://stat.ethz.ch/R-manual/R-patched/library/MASS/html/stepAIC.html m_2 <- stepAIC(m_1, trace=0, scope=list(upper = ~ zn * indus * chas * nox * rm * age * dis * rad * tax * ptratio * lstat*medv, lower= ~1)) pred_2 <- factor(round(predict(m_2, train, type='response')), levels=c('0', '1')) calc_r2(m_2) confusionMatrix(data=pred_2, reference=factor(train$target, levels=c('0', '1')))

summary(m_2)

However, this model is probably overfit. By common heuristic, we have enough data for:

min(table(train$target)) / 15

i.e., 12 variables.

$M_3$: Adjusting for multiple significance tests

To correct for this overfitting, we will use the p.adjust function to revise our p-values, and then use those that remain significant at $p = 0.05$ for the next model:

m_2_p <- summary(m_2)$coefficients[,4] 
sort(p.adjust(m_2_p))

Using the top values (including any variable as well as interaction effect:

m_3 <- glm(target ~ age*rm + rad + age*medv, train, family=binomial())
pred_3 <- factor(round(predict(m_3, train, type='response')), levels=c('0', '1'))
calc_r2(m_3)
confusionMatrix(data=pred_3, reference=factor(train$target, levels=c('0', '1')))

The pseudo-$R^2$ is naturally much less than the overfit $M_2$. Presumably, it will be better fit to the hold-out sample, however. We do see that sensitivity, specificity, and pos/neg predictive value are actually still pretty strong. As expected and required, all variables are extremely significant.

summary(m_3)

$M_4$: Previous model + a few more predictors

We noted above that we have data for up to 12 variables in this model, so I will include the first 12 significant variables of the p-value adjustment:

{r warning=FALSE} m_4 <- glm(target ~ age*rm + rad + age*medv + indus*tax + dis*tax + nox*age + zn, train, family=binomial()) pred_4 <- factor(round(predict(m_4, train, type='response')), levels=c('0', '1')) calc_r2(m_4) confusionMatrix(data=pred_4, reference=factor(train$target, levels=c('0', '1')))

Despite adding all these variables, we see that the confusion matrix evaluations are not that much higher. Pseudo-$R^2$ did take a nice bump, though. Nonetheless, it is possible that this model does not fit the hold out sample as well as $M_3$.

summary(m_4)

$M_5$: PCA

pca <- prcomp(train[,1:12], retx=TRUE, center=TRUE, scale=TRUE)
summary(pca)

The first five account for 87 percent of variation, so we will use those for modeling:

pca_df <- as.data.frame(cbind(train$target, pca$x[,1:5]))
colnames(pca_df) <- c('target', 'PC1', 'PC2' ,'PC3', 'PC4', 'PC5')
m_5 <- glm(target ~ ., pca_df, family=binomial())
pred_5 <- factor(round(predict(m_5, pca_df, type='response')), levels=c('0', '1'))
confusionMatrix(data=pred_5, reference=factor(train$target, levels=c('0', '1')))
calc_r2(m_5)

This model has similar confusion matrix evaluation values as some models above, though it’s pseudo-$R^2$ value is a bit low.

The results of this exercise with PCA seem to suggests there are three separate ‘clusters’ of phenomenon that affect crime level, at least at a statistically significant level. All three are negative related.

summary(m_5)

Summary of Psuedo-$R^2$ and Accuracy of Different Models on Training Data

Model	pseudo-$R^2$	Accuracy	Description
m_0	0	0.5013	Dummy model
m_1	0.7216759	0.9115	Full Model
m_2	0.9022719	0.9759	Stepwise variable selection with interactions
m_3	0.609916	0.8686	Adjusting m_2 for multiple significance tests
m_4	0.7624825	0.9276	m_3 plus few more predictors
m_5	0.5806149	0.8365	PCA

Evaluating the Models on the Test Set

# Don't run until the very end
# confusionMatrix(data=predict(model, test), reference=test$target)
# Evaluate on F1 score
# For PCA prediction:
# pred_xx <- factor(round(predict(m_5, as.data.frame(predict(pca, newdata=test)), type='response')), levels=c('0', '1'))

Model m_1 on the test set:

pred_test_1 <- factor(round(predict(m_1, test, type='response')), levels=c('0', '1'))
confusionMatrix(data=pred_test_1, reference=factor(test$target, levels=c('0', '1')))
F1_Score(y_pred = pred_test_1, y_true = test$target, positive = "0")

Model m_2 on the test set:

pred_test_2 <- factor(round(predict(m_2, test, type='response')), levels=c('0', '1'))
confusionMatrix(data=pred_test_2, reference=factor(test$target, levels=c('0', '1')))
F1_Score(y_pred = pred_test_2, y_true = test$target, positive = "0")

Model m_3 on the test set:

pred_test_3 <- factor(round(predict(m_3, test, type='response')), levels=c('0', '1'))
confusionMatrix(data=pred_test_3, reference=factor(test$target, levels=c('0', '1')))
F1_Score(y_pred = pred_test_3, y_true = test$target, positive = "0")

Model m_4 on the test set:

pred_test_4 <- factor(round(predict(m_4, test, type='response')), levels=c('0', '1'))
confusionMatrix(data=pred_test_4, reference=factor(test$target, levels=c('0', '1')))
F1_Score(y_pred = pred_test_4, y_true = test$target, positive = "0")

Model m_5 on the test set:

pca_test <- prcomp(test[,1:12], retx=TRUE, center=TRUE, scale=TRUE)
pca_test_df <- as.data.frame(cbind(test$target, pca_test$x[,1:5]))
colnames(pca_test_df) <- c('target', 'PC1', 'PC2' ,'PC3', 'PC4', 'PC5')
pred_test_5 <- factor(round(predict(m_5, pca_test_df, type='response')), levels=c('0', '1'))
confusionMatrix(data=pred_test_5, reference=factor(test$target, levels=c('0', '1')))
F1_Score(y_pred = pred_test_5, y_true = test$target, positive = "0")

Summary of Accuracy and F1 Score on Test Set

Model	Accuracy	F1 Score	Description
m_1	0.9032	0.9126214	Full Model
m_2	0.9462	0.950495	Stepwise variable selection with interactions
m_3	0.828	0.84	Adjusting m_2 for multiple significance tests
m_4	0.8925	0.9056604	m_3 plus few more predictors
m_5	0.2258	0.1818182	PCA

DATA 621—Assignment no. 3

Critical Thinking Group 2

October 30, 2019

Executive Overview

Data Exploration

Checking for interactions

Data Preparation

Modeling

\(M_0\): Dummy model

\(M_1\): Full model

\(M_2\): Stepwise variable selection with interactions

\(M_3\): Adjusting for multiple significance tests

\(M_4\): Previous model + a few more predictors

\(M_5\): PCA

Summary of Psuedo-\(R^2\) and Accuracy of Different Models on Training Data

Evaluating the Models on the Test Set

Summary of Accuracy and F1 Score on Test Set

Analysis of Final Model