Overview

In this homework assignment, you will explore, analyze and model a data set containing information on crime for various neighborhoods of a major city. Each record has a response variable indicating whether or not the crime rate is above the median crime rate (1) or not (0).

Your objective is to build a binary logistic regression model on the training data set to predict whether the neighborhood will be at risk for high crime levels. You will provide classifications and probabilities for the evaluation data set using your binary logistic regression model. You can only use the variables given to you (or variables that you derive from the variables provided). Below is a short description of the variables of interest in the data set:

zn: proportion of residential land zoned for large lots (over 25000 square feet) (predictor variable)
indus: proportion of non-retail business acres per suburb (predictor variable)
chas: a dummy var. for whether the suburb borders the Charles River (1) or not (0) (predictor variable)
nox: nitrogen oxides concentration (parts per 10 million) (predictor variable)
rm: average number of rooms per dwelling (predictor variable)
age: proportion of owner-occupied units built prior to 1940 (predictor variable)
dis: weighted mean of distances to five Boston employment centers (predictor variable)
rad: index of accessibility to radial highways (predictor variable)
tax: full-value property-tax rate per $10,000 (predictor variable)
ptratio: pupil-teacher ratio by town (predictor variable)
black: 1000(Bk - 0.63)2 where Bk is the proportion of blacks by town (predictor variable)
lstat: lower status of the population (percent) (predictor variable)
medv: median value of owner-occupied homes in $1000s (predictor variable)
target: whether the crime rate is above the median crime rate (1) or not (0) (response variable)

Deliverables:

A write-up submitted in PDF format. Your write-up should have four sections. Each one is described below. You may assume you are addressing me as a fellow data scientist, so do not need to shy away from technical details.
Assigned prediction (probabilities, classifications) for the evaluation data set. Use 0.5 threshold.
Include your R statistical programming code in an Appendix.

1. DATA EXPLORATION

Describe the size and the variables in the crime training data set. Consider that too much detail will cause a manager to lose interest while too little detail will make the manager consider that you aren’t doing your job.

Data acquisition

all_train <- read.csv("https://raw.githubusercontent.com/miachen410/DATA621/master/HW%233/crime-training-data_modified.csv")
eval <- read.csv("https://raw.githubusercontent.com/miachen410/DATA621/master/HW%233/crime-evaluation-data_modified.csv")

Data structure

There are 466 observations and 13 variables in the training dataset.

str(all_train)

## 'data.frame':    466 obs. of  13 variables:
##  $ zn     : num  0 0 0 30 0 0 0 0 0 80 ...
##  $ indus  : num  19.58 19.58 18.1 4.93 2.46 ...
##  $ chas   : int  0 1 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.605 0.871 0.74 0.428 0.488 0.52 0.693 0.693 0.515 0.392 ...
##  $ rm     : num  7.93 5.4 6.49 6.39 7.16 ...
##  $ age    : num  96.2 100 100 7.8 92.2 71.3 100 100 38.1 19.1 ...
##  $ dis    : num  2.05 1.32 1.98 7.04 2.7 ...
##  $ rad    : int  5 5 24 6 3 5 24 24 5 1 ...
##  $ tax    : int  403 403 666 300 193 384 666 666 224 315 ...
##  $ ptratio: num  14.7 14.7 20.2 16.6 17.8 20.9 20.2 20.2 20.2 16.4 ...
##  $ lstat  : num  3.7 26.82 18.85 5.19 4.82 ...
##  $ medv   : num  50 13.4 15.4 23.7 37.9 26.5 5 7 22.2 20.9 ...
##  $ target : int  1 1 1 0 0 0 1 1 0 0 ...

Just to make the data easier on the eyes, we convert the 1s in chas to “Y”, if the neighborhood borders Charles River, and 0s to “N”, if not.

# all_train$chas[all_train$chas == 1] <- "Y"
# all_train$chas[all_train$chas == 0] <- "N"

We also convert the 1s in target to “Above”, if the crime rate is above the median, and 0s to “Below”, if it is below the median.

# all_train$target[all_train$target == 1] <- "Above"
# all_train$target[all_train$target == 0] <- "Below"

Since variables chas and target are categorical, we are going to change their class from integer to factor:

all_train$chas <- as.factor(all_train$chas)
all_train$target <- as.factor(all_train$target)

Let’s look at the data structure again:

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

glimpse(all_train)

## Observations: 466
## Variables: 13
## $ zn      <dbl> 0, 0, 0, 30, 0, 0, 0, 0, 0, 80, 22, 0, 0, 22, 0, 0, 100,…
## $ indus   <dbl> 19.58, 19.58, 18.10, 4.93, 2.46, 8.56, 18.10, 18.10, 5.1…
## $ chas    <fct> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ nox     <dbl> 0.605, 0.871, 0.740, 0.428, 0.488, 0.520, 0.693, 0.693, …
## $ rm      <dbl> 7.929, 5.403, 6.485, 6.393, 7.155, 6.781, 5.453, 4.519, …
## $ age     <dbl> 96.2, 100.0, 100.0, 7.8, 92.2, 71.3, 100.0, 100.0, 38.1,…
## $ dis     <dbl> 2.0459, 1.3216, 1.9784, 7.0355, 2.7006, 2.8561, 1.4896, …
## $ rad     <int> 5, 5, 24, 6, 3, 5, 24, 24, 5, 1, 7, 5, 24, 7, 3, 3, 5, 5…
## $ tax     <int> 403, 403, 666, 300, 193, 384, 666, 666, 224, 315, 330, 3…
## $ ptratio <dbl> 14.7, 14.7, 20.2, 16.6, 17.8, 20.9, 20.2, 20.2, 20.2, 16…
## $ lstat   <dbl> 3.70, 26.82, 18.85, 5.19, 4.82, 7.67, 30.59, 36.98, 5.68…
## $ medv    <dbl> 50.0, 13.4, 15.4, 23.7, 37.9, 26.5, 5.0, 7.0, 22.2, 20.9…
## $ target  <fct> 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1,…

Summary Statistics

Looking at the target variable, we see 237 observations are below the median crime rate and 229 are above the median crime rate, thus we have roughly the same number of at risk and not-at-risk neighborhoods in our training data set.

summary(all_train)

##        zn             indus        chas         nox        
##  Min.   :  0.00   Min.   : 0.460   0:433   Min.   :0.3890  
##  1st Qu.:  0.00   1st Qu.: 5.145   1: 33   1st Qu.:0.4480  
##  Median :  0.00   Median : 9.690           Median :0.5380  
##  Mean   : 11.58   Mean   :11.105           Mean   :0.5543  
##  3rd Qu.: 16.25   3rd Qu.:18.100           3rd Qu.:0.6240  
##  Max.   :100.00   Max.   :27.740           Max.   :0.8710  
##        rm             age              dis              rad       
##  Min.   :3.863   Min.   :  2.90   Min.   : 1.130   Min.   : 1.00  
##  1st Qu.:5.887   1st Qu.: 43.88   1st Qu.: 2.101   1st Qu.: 4.00  
##  Median :6.210   Median : 77.15   Median : 3.191   Median : 5.00  
##  Mean   :6.291   Mean   : 68.37   Mean   : 3.796   Mean   : 9.53  
##  3rd Qu.:6.630   3rd Qu.: 94.10   3rd Qu.: 5.215   3rd Qu.:24.00  
##  Max.   :8.780   Max.   :100.00   Max.   :12.127   Max.   :24.00  
##       tax           ptratio         lstat             medv       target 
##  Min.   :187.0   Min.   :12.6   Min.   : 1.730   Min.   : 5.00   0:237  
##  1st Qu.:281.0   1st Qu.:16.9   1st Qu.: 7.043   1st Qu.:17.02   1:229  
##  Median :334.5   Median :18.9   Median :11.350   Median :21.20          
##  Mean   :409.5   Mean   :18.4   Mean   :12.631   Mean   :22.59          
##  3rd Qu.:666.0   3rd Qu.:20.2   3rd Qu.:16.930   3rd Qu.:25.00          
##  Max.   :711.0   Max.   :22.0   Max.   :37.970   Max.   :50.00

Missing values

How many rows of data have NA values? 0 rows, thus there are no missing values in the dataset.

nrow(all_train[is.na(all_train),])

## [1] 0

Visulization of the data set

Let’s first look at the density plots of the numerical variables to view their shapes and distributions:

library(reshape)

## 
## Attaching package: 'reshape'

## The following object is masked from 'package:dplyr':
## 
##     rename

library(ggplot2)
datasub = melt(all_train)

## Using chas, target as id variables

ggplot(datasub, aes(x = value)) + 
    geom_density(fill = "blue") + 
    facet_wrap(~variable, scales = 'free')

For categorial variable chas, we can look at a confusion matrix table to make sure that we have enough observations for all levels:

xtabs(~ target + chas, data=all_train)

##       chas
## target   0   1
##      0 225  12
##      1 208  21

Then we will look at the boxplots of the numerical variables in relationship to target variable:

ggplot(datasub, aes(x = target, y = value)) + 
    geom_boxplot() + 
    facet_wrap(~variable, scales = 'free')

2. DATA PREPARATION

Describe how you have transformed the data by changing the original variables or creating new variables. If you did transform the data or create new variables, discuss why you did this.

Outlier Imputation

From the boxplots above, we can see there are many outliers so we are going to fix them by replacing with medians.

train_clean_pre <- all_train 
# %>% mutate(
#   zn = ifelse(zn > 80, median(zn), zn),
#   indus = ifelse(indus > 20, median(indus), indus),
#   rm = ifelse(rm > 7.5 | rm < 5, median(rm), rm),
#   dis = ifelse(dis > 7.5, median(dis), dis),
#   tax = ifelse(tax >= 700, median(tax), tax),
#   ptratio = ifelse(ptratio < 15, median(ptratio), ptratio),
#   lstat = ifelse(lstat > 23, median(lstat), lstat),
#   medv = ifelse(medv > 35 | medv < 10, median(medv), medv)
# )
set.seed(121)
split <- caret::createDataPartition(train_clean_pre$target, p=0.85, list=FALSE)
train_clean <- train_clean_pre[split, ]
validation <- train_clean_pre[ -split, ]

Let’s look at the boxplots again after the outliers being imputed with median.

ggplot(melt(train_clean), aes(x = target, y = value)) + 
    geom_boxplot() + 
    facet_wrap(~variable, scales = 'free')

## Using chas, target as id variables

3. BUILD MODELS

Using the training data, build at least three different binary logistic regression models, using different variables (or the same variables with different transformations). You may select the variables manually, use an approach such as Forward or Stepwise, use a different approach, or use a combination of techniques. Describe the techniques you used. If you manually selected a variable for inclusion into the model or exclusion into the model, indicate why this was done.

Be sure to explain how you can make inferences from the model, as well as discuss other relevant model output. Discuss the coefficients in the models, do they make sense? Are you keeping the model even though it is counter intuitive? Why? The boss needs to know.

We first create a full model by including all the variables:

fullMod <- glm(target ~ ., data = train_clean, family = 'binomial')
summary(fullMod)

## 
## Call:
## glm(formula = target ~ ., family = "binomial", data = train_clean)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.9657  -0.1865  -0.0020   0.0027   3.3571  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -39.933271   6.795913  -5.876 4.20e-09 ***
## zn           -0.057803   0.034353  -1.683  0.09245 .  
## indus        -0.051408   0.050393  -1.020  0.30766    
## chas1         0.703716   0.782689   0.899  0.36860    
## nox          48.096206   8.244039   5.834 5.41e-09 ***
## rm           -0.702882   0.782068  -0.899  0.36879    
## age           0.028333   0.014006   2.023  0.04308 *  
## dis           0.697716   0.230131   3.032  0.00243 ** 
## rad           0.706268   0.176744   3.996 6.44e-05 ***
## tax          -0.006777   0.003232  -2.097  0.03601 *  
## ptratio       0.414753   0.135580   3.059  0.00222 ** 
## lstat         0.074600   0.056935   1.310  0.19011    
## medv          0.191484   0.073820   2.594  0.00949 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 550.24  on 396  degrees of freedom
## Residual deviance: 175.85  on 384  degrees of freedom
## AIC: 201.85
## 
## Number of Fisher Scoring iterations: 9

Model 1 - P-values Selection

From the full model, we select the variables that have small p-values:

target ~ nox + age + rad + tax + ptratio + lstat

logMod1 <- glm(target ~ nox + age + rad + tax + ptratio + lstat, 
               data = train_clean, 
               family = 'binomial')
summary(logMod1)

## 
## Call:
## glm(formula = target ~ nox + age + rad + tax + ptratio + lstat, 
##     family = "binomial", data = train_clean)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.00876  -0.25415  -0.01591   0.00496   2.69483  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -24.049881   3.853469  -6.241 4.35e-10 ***
## nox          34.024441   5.592815   6.084 1.18e-09 ***
## age           0.015474   0.010432   1.483 0.137982    
## rad           0.767007   0.149549   5.129 2.92e-07 ***
## tax          -0.010374   0.002828  -3.668 0.000244 ***
## ptratio       0.209831   0.093751   2.238 0.025210 *  
## lstat         0.008948   0.038337   0.233 0.815440    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 550.24  on 396  degrees of freedom
## Residual deviance: 196.37  on 390  degrees of freedom
## AIC: 210.37
## 
## Number of Fisher Scoring iterations: 8

Model 2 - Backward Selection

library(MASS)

## 
## Attaching package: 'MASS'

## The following object is masked from 'package:dplyr':
## 
##     select

logMod2 <- fullMod %>% stepAIC(direction = "backward", trace = FALSE)
summary(logMod2)

## 
## Call:
## glm(formula = target ~ zn + nox + age + dis + rad + tax + ptratio + 
##     lstat + medv, family = "binomial", data = train_clean)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.9810  -0.2345  -0.0020   0.0038   3.2862  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -38.661039   6.361731  -6.077 1.22e-09 ***
## zn           -0.064092   0.032646  -1.963 0.049617 *  
## nox          42.870904   6.948143   6.170 6.82e-10 ***
## age           0.022268   0.011677   1.907 0.056531 .  
## dis           0.621377   0.214394   2.898 0.003752 ** 
## rad           0.753365   0.161465   4.666 3.07e-06 ***
## tax          -0.008576   0.002893  -2.964 0.003037 ** 
## ptratio       0.344730   0.116941   2.948 0.003199 ** 
## lstat         0.096831   0.050645   1.912 0.055883 .  
## medv          0.139702   0.042094   3.319 0.000904 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 550.24  on 396  degrees of freedom
## Residual deviance: 178.21  on 387  degrees of freedom
## AIC: 198.21
## 
## Number of Fisher Scoring iterations: 9

Model 3 - Forward Selection

# Create an empty model with no variables
emptyMod <- glm(target ~ 1, data = train_clean, family = 'binomial')
logMod3 <- emptyMod %>% 
  stepAIC(direction = "forward",
          scope = ~ zn + indus + chas + nox + rm + age + dis 
                    + rad + tax + ptratio + lstat + medv, 
          trace = FALSE)
summary(logMod3)

## 
## Call:
## glm(formula = target ~ nox + rad + tax + ptratio + medv + lstat + 
##     dis + zn + age, family = "binomial", data = train_clean)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.9810  -0.2345  -0.0020   0.0038   3.2862  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -38.661039   6.361731  -6.077 1.22e-09 ***
## nox          42.870904   6.948143   6.170 6.82e-10 ***
## rad           0.753365   0.161465   4.666 3.07e-06 ***
## tax          -0.008576   0.002893  -2.964 0.003037 ** 
## ptratio       0.344730   0.116941   2.948 0.003199 ** 
## medv          0.139702   0.042094   3.319 0.000904 ***
## lstat         0.096831   0.050645   1.912 0.055883 .  
## dis           0.621377   0.214394   2.898 0.003752 ** 
## zn           -0.064092   0.032646  -1.963 0.049617 *  
## age           0.022268   0.011677   1.907 0.056531 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 550.24  on 396  degrees of freedom
## Residual deviance: 178.21  on 387  degrees of freedom
## AIC: 198.21
## 
## Number of Fisher Scoring iterations: 9

4. SELECT MODELS

Decide on the criteria for selecting the best binary logistic regression model. Will you select models with slightly worse performance if it makes more sense or is more parsimonious? Discuss why you selected your models.

For the binary logistic regression model, will you use a metric such as log likelihood, AIC, ROC curve, etc.? Using the training data set, evaluate the binary logistic regression model based on (a) accuracy, (b) classification error rate, (c) precision, (d) sensitivity, (e) specificity, (f) F1 score, (g) AUC, and (h) confusion matrix. Make predictions using the evaluation data set.

formula(logMod1) # Model 1 formula

## target ~ nox + age + rad + tax + ptratio + lstat

formula(logMod2) # Model 2 formula

## target ~ zn + nox + age + dis + rad + tax + ptratio + lstat + 
##     medv

formula(logMod3) # Model 3 formula

## target ~ nox + rad + tax + ptratio + medv + lstat + dis + zn + 
##     age

preds1 =predict(logMod1, newdata = validation)
preds2 =predict(logMod2, newdata = validation)
preds3 =predict(logMod3, newdata = validation)
preds1[preds1 >= 0.5] <- 1
preds1[preds1 < 0.5] <- 0
preds1 = as.factor(preds1)
preds2[preds2 >= 0.5] <- 1
preds2[preds2 < 0.5] <- 0
preds2 = as.factor(preds2)
preds3[preds3 >= 0.5] <- 1
preds3[preds3 < 0.5] <- 0
preds3 = as.factor(preds3)

library(caret)

## Loading required package: lattice

m1cM <- confusionMatrix(preds1, validation$target, mode = "everything")
m2cM <- confusionMatrix(preds2, validation$target, mode = "everything")
m3cM <- confusionMatrix(preds3, validation$target, mode = "everything")

fourfoldplot(m1cM$table, color = c("#B22222", "#2E8B57"), main="Model 1")

fourfoldplot(m2cM$table, color = c("#B22222", "#2E8B57"), main="Model 2")

fourfoldplot(m3cM$table, color = c("#B22222", "#2E8B57"), main="Model 3")

## Model2 and model3 has better accuracy and lower error rate

library(kableExtra)

## 
## Attaching package: 'kableExtra'

## The following object is masked from 'package:dplyr':
## 
##     group_rows

temp <- data.frame(m1cM$overall, 
                   m2cM$overall, 
                   m3cM$overall) %>%
  t() %>%
  data.frame() %>%
  dplyr::select(Accuracy) %>%
  mutate(Classification_Error_Rate = 1-Accuracy)
Summ_Stat <-data.frame(m1cM$byClass, 
                   m2cM$byClass, 
                   m3cM$byClass) %>%
  t() %>%
  data.frame() %>%
  cbind(temp) %>%
  mutate(Model = c("Model 1", "Model 2", "Model 3")) %>%
  dplyr::select(Model, Accuracy, Classification_Error_Rate, Precision, Sensitivity, Specificity, F1) %>%
  mutate_if(is.numeric, round,3) %>%
  kable('html', escape = F) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),full_width = F)
Summ_Stat

Model	Accuracy	Classification_Error_Rate	Precision	Sensitivity	Specificity	F1
Model 1	0.928	0.072	0.917	0.943	0.912	0.930
Model 2	0.913	0.087	0.892	0.943	0.882	0.917
Model 3	0.913	0.087	0.892	0.943	0.882	0.917

Model 2,3 has better AUC

getROC <- function(model) {
    name <- deparse(substitute(model))
    pred.prob1 <- predict(model, newdata = validation)
    p1 <- data.frame(pred = validation$target, prob = pred.prob1)
    p1 <- p1[order(p1$prob),]
    rocobj <- pROC::roc(p1$pred, p1$prob)
    plot(rocobj, asp=NA, legacy.axes = TRUE, print.auc=TRUE,
         xlab="Specificity", main = name)
}
par(mfrow=c(3,3))
getROC(logMod1)

## Setting levels: control = 0, case = 1

## Setting direction: controls < cases

getROC(logMod2)

## Setting levels: control = 0, case = 1
## Setting direction: controls < cases

getROC(logMod3)

## Setting levels: control = 0, case = 1
## Setting direction: controls < cases

## Model 2 and 3 has lower AIC. Based on comparison result, we eventually choose Model 2/3 (same model)

Make Prediction

eval$chas <- as.factor(eval$chas)
prediction = predict(logMod2, newdata = eval)
prediction[prediction >= 0.5] <- 1
prediction[prediction < 0.5] <- 0
prediction = as.factor(prediction)
prediction

##  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 
##  0  0  1  1  0  0  0  0  0  0  1  1  1  1  0  0  0  1  0  0  0  0  0  0  0 
## 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 
##  1  0  1  1  1  1  1  1  1  1  1  1  1  1  0 
## Levels: 0 1

DATA621 - HW#3

Mia Chen, Wei Zhou

4/12/2020