{r setup, include=FALSE, echo=FALSE} knitr::opts_chunk$set(message=FALSE, warning=FALSE) # Load libraries library(caret) library(corrplot) library(dplyr) library(ggplot2) library(knitr) library(MASS) library(tidyr) library(MLmetrics)
In this homework assignment, you will explore, analyze and model a data set containing information on crime for various neighborhoods of a major city. Each record has a response variable indicating whether or not the crime rate is above the median crime rate (1) or not (0).
Your objective is to build a binary logistic regression model on the training data set to predict whether the neighborhood will be at risk for high crime levels.
Below is a short description of the variables of interest in the data set:
Variable | Description |
---|---|
zn | proportion of residential land zoned for large lots (over 25000 square feet) (predictor variable) |
indus | proportion of non-retail business acres per suburb (predictor variable) |
chas | a dummy var. for whether the suburb borders the Charles River (1) or not (0) (predictor variable) |
nox | nitrogen oxides concentration (parts per 10 million) (predictor variable) |
rm | average number of rooms per dwelling (predictor variable) |
age | proportion of owner-occupied units built prior to 1940 (predictor variable) |
dis | weighted mean of distances to five Boston employment centers (predictor variable) |
rad | index of accessibility to radial highways (predictor variable) |
tax | full-value property-tax rate per $10,000 (predictor variable) |
ptratio | pupil-teacher ratio by town (predictor variable) |
lstat | lower status of the population (percent) (predictor variable) |
medv | median value of owner-occupied homes in $1000s (predictor variable) |
target | whether the crime rate is above the median crime rate (1) or not (0) (response variable) |
Create train and test sets using the caret
machine learning package:
Only use the train
data frame until the very end of the process, when we use test to evaluate how effective the model is!
df <- read.csv('crime-training-data_modified.csv', stringsAsFactors=FALSE)
set.seed(1804)
#80% train, 20% test split
train_ix <- createDataPartition(df$target, p=0.8, list=FALSE)
train <- df[train_ix, ]
test <- df[-train_ix, ]
rm(df)
Below is descriptive statistic of the variables. There are no NA values.
(train.summary <- data.frame(unclass(summary(train)), row.names = NULL))
There don’t seem to be any outliers or missing data, so we will proceed directly to examining the variables. First, histograms of each variable for each target
class:
train %>%
gather(-target, key='variable', value='value') %>%
ggplot(aes(x=value, group=target, color=target)) +
facet_wrap(~ variable, scales='free') +
geom_density()
Most variables have distinct shapes for each target
class. chas
and zn
are quite skewed, and do not appear terribly informative. indus
and tax
have two peaks for target = 1
, indicating there are two separate processes at work there.
ggplot(train, aes(x=jitter(nox), y=jitter(as.numeric(target)))) +
geom_point() +
geom_smooth() +
geom_smooth(method='lm', color='red')
It is to be expected that many of these variables will be correlated with each other:
{r, fig.width=10} corrplot(cor(train), type='upper', method='number', order='hclust')
Obviously, the concentration of industry is strongly and positively correlated with nitrogen oxide concentration \(\rho= 0.78\)). Parent-teacher ratio is negatively correlated with median property values (\(\rho = -0.5\)), and positively correlated with property taxes (\(\rho = 0.49\)). What these and other variables are really getting at is economic class. Each measures a different phenomenon, but can be conceived of as operationalizing one thing. This suggests PCA may be useful on this dataset.
Given the high correlation between the variables, it may be the case that there are numerous interactions that can improve our modeling. In this section, we attempt to determine if this is the case. We will group numeric variables by membership in quartile, and examine line plots.
calc_percentile <- function(x){
trunc(rank(x)) / length(x)
}
There doesn’t seem to be any missing data or any obvious outliers.
Function to calculate McFadden’s pseudo-\(R^2\) for logistic models:
calc_r2 <- function(model) {
1 - model$deviance / model$null.deviance
}
Baseline model, which just predicts the class proportion, which is nearly balanced between the two classes. If we are having trouble improving on this model, we know we are doing something wrong.
This dummy model has an accuracy of about 0.50, sensitivity of 1, and specificity of 0. Since it has zero predictive power, we know that it has a pseudo-\(R^2\) of 0.
m_0 <- glm(target ~ 1, train, family=binomial())
pred_0 <- factor(round(predict(m_0, train, type='response')), levels=c('0', '1'))
confusionMatrix(data=pred_0, reference=factor(train$target, levels=c('0', '1')))
m_0 #null deviance and residual deviance are the same --> R2 is zero
The next simplest model uses all available data, without transformations or interactions or polynomials:
m_1 <- glm(target ~ zn + indus + chas + nox + rm + age + dis + rad + tax + ptratio + lstat + medv, train, family=binomial())
pred_1 <- factor(round(predict(m_1, train, type='response')), levels=c('0', '1'))
calc_r2(m_1)
confusionMatrix(data=pred_1, reference=factor(train$target, levels=c('0', '1')))
https://www.theanalysisfactor.com/interpreting-interactions-in-regression/
We know that variable interaction is probably likely. We can automatically test all interactions using stepwise selection:
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred Explanation found: https://stats.stackexchange.com/questions/11109/how-to-deal-with-perfect-separation-in-logistic-regression “If you have a variable which perfectly separates zeroes and ones in target variable, R will yield the following”perfect or quasi perfect separation" “We still get the model but the coefficient estimates are inflated.” We are suppressing the warning.
{r warning=FALSE} #https://stat.ethz.ch/R-manual/R-patched/library/MASS/html/stepAIC.html m_2 <- stepAIC(m_1, trace=0, scope=list(upper = ~ zn * indus * chas * nox * rm * age * dis * rad * tax * ptratio * lstat*medv, lower= ~1)) pred_2 <- factor(round(predict(m_2, train, type='response')), levels=c('0', '1')) calc_r2(m_2) confusionMatrix(data=pred_2, reference=factor(train$target, levels=c('0', '1')))
summary(m_2)
However, this model is probably overfit. By common heuristic, we have enough data for:
min(table(train$target)) / 15
i.e., 12 variables.
To correct for this overfitting, we will use the p.adjust
function to revise our p-values, and then use those that remain significant at \(p = 0.05\) for the next model:
m_2_p <- summary(m_2)$coefficients[,4]
sort(p.adjust(m_2_p))
Using the top values (including any variable as well as interaction effect:
m_3 <- glm(target ~ age*rm + rad + age*medv, train, family=binomial())
pred_3 <- factor(round(predict(m_3, train, type='response')), levels=c('0', '1'))
calc_r2(m_3)
confusionMatrix(data=pred_3, reference=factor(train$target, levels=c('0', '1')))
The pseudo-\(R^2\) is naturally much less than the overfit \(M_2\). Presumably, it will be better fit to the hold-out sample, however. We do see that sensitivity, specificity, and pos/neg predictive value are actually still pretty strong. As expected and required, all variables are extremely significant.
summary(m_3)
We noted above that we have data for up to 12 variables in this model, so I will include the first 12 significant variables of the p-value adjustment:
{r warning=FALSE} m_4 <- glm(target ~ age*rm + rad + age*medv + indus*tax + dis*tax + nox*age + zn, train, family=binomial()) pred_4 <- factor(round(predict(m_4, train, type='response')), levels=c('0', '1')) calc_r2(m_4) confusionMatrix(data=pred_4, reference=factor(train$target, levels=c('0', '1')))
Despite adding all these variables, we see that the confusion matrix evaluations are not that much higher. Pseudo-\(R^2\) did take a nice bump, though. Nonetheless, it is possible that this model does not fit the hold out sample as well as \(M_3\).
summary(m_4)
pca <- prcomp(train[,1:12], retx=TRUE, center=TRUE, scale=TRUE)
summary(pca)
The first five account for 87 percent of variation, so we will use those for modeling:
pca_df <- as.data.frame(cbind(train$target, pca$x[,1:5]))
colnames(pca_df) <- c('target', 'PC1', 'PC2' ,'PC3', 'PC4', 'PC5')
m_5 <- glm(target ~ ., pca_df, family=binomial())
pred_5 <- factor(round(predict(m_5, pca_df, type='response')), levels=c('0', '1'))
confusionMatrix(data=pred_5, reference=factor(train$target, levels=c('0', '1')))
calc_r2(m_5)
This model has similar confusion matrix evaluation values as some models above, though it’s pseudo-\(R^2\) value is a bit low.
The results of this exercise with PCA seem to suggests there are three separate ‘clusters’ of phenomenon that affect crime level, at least at a statistically significant level. All three are negative related.
summary(m_5)
Model | pseudo-\(R^2\) | Accuracy | Description |
---|---|---|---|
m_0 | 0 | 0.5013 | Dummy model |
m_1 | 0.7216759 | 0.9115 | Full Model |
m_2 | 0.9022719 | 0.9759 | Stepwise variable selection with interactions |
m_3 | 0.609916 | 0.8686 | Adjusting m_2 for multiple significance tests |
m_4 | 0.7624825 | 0.9276 | m_3 plus few more predictors |
m_5 | 0.5806149 | 0.8365 | PCA |
# Don't run until the very end
# confusionMatrix(data=predict(model, test), reference=test$target)
# Evaluate on F1 score
# For PCA prediction:
# pred_xx <- factor(round(predict(m_5, as.data.frame(predict(pca, newdata=test)), type='response')), levels=c('0', '1'))
Model m_1
on the test set:
pred_test_1 <- factor(round(predict(m_1, test, type='response')), levels=c('0', '1'))
confusionMatrix(data=pred_test_1, reference=factor(test$target, levels=c('0', '1')))
F1_Score(y_pred = pred_test_1, y_true = test$target, positive = "0")
Model m_2
on the test set:
pred_test_2 <- factor(round(predict(m_2, test, type='response')), levels=c('0', '1'))
confusionMatrix(data=pred_test_2, reference=factor(test$target, levels=c('0', '1')))
F1_Score(y_pred = pred_test_2, y_true = test$target, positive = "0")
Model m_3
on the test set:
pred_test_3 <- factor(round(predict(m_3, test, type='response')), levels=c('0', '1'))
confusionMatrix(data=pred_test_3, reference=factor(test$target, levels=c('0', '1')))
F1_Score(y_pred = pred_test_3, y_true = test$target, positive = "0")
Model m_4
on the test set:
pred_test_4 <- factor(round(predict(m_4, test, type='response')), levels=c('0', '1'))
confusionMatrix(data=pred_test_4, reference=factor(test$target, levels=c('0', '1')))
F1_Score(y_pred = pred_test_4, y_true = test$target, positive = "0")
Model m_5
on the test set:
pca_test <- prcomp(test[,1:12], retx=TRUE, center=TRUE, scale=TRUE)
pca_test_df <- as.data.frame(cbind(test$target, pca_test$x[,1:5]))
colnames(pca_test_df) <- c('target', 'PC1', 'PC2' ,'PC3', 'PC4', 'PC5')
pred_test_5 <- factor(round(predict(m_5, pca_test_df, type='response')), levels=c('0', '1'))
confusionMatrix(data=pred_test_5, reference=factor(test$target, levels=c('0', '1')))
F1_Score(y_pred = pred_test_5, y_true = test$target, positive = "0")
Model | Accuracy | F1 Score | Description |
---|---|---|---|
m_1 | 0.9032 | 0.9126214 | Full Model |
m_2 | 0.9462 | 0.950495 | Stepwise variable selection with interactions |
m_3 | 0.828 | 0.84 | Adjusting m_2 for multiple significance tests |
m_4 | 0.8925 | 0.9056604 | m_3 plus few more predictors |
m_5 | 0.2258 | 0.1818182 | PCA |