Introduction

This assignment explores, analyzes and models a data set containing information on crime and demographic details of the neighborhoods of a city. Each record show a target variable indicating whether the crime rate is above(1) or below(0) the city median crime rate. We build a set of binary logistic regression models using only the provided data. We also generate and evaluate the models on a test data set.

The paper is organized as follows. In the first section below, we conduct exploratory data analysis of the variables. In the second section, we prepare data transformations needed to build useful secondary variables. In the third section, we construct three different binary logistic regression models. In the fourth section, we select a model and use classification metrics to assess the optimal choice. The appendix contains the full description of the code.

1 Exploratory Data Analysis

The exploratory data analysis proceeds as follows.
We first consider summary statistics of the data file and look for missing or exceptional values. Then we evaluate the individual variables systematically in relation to the target variable (high vs. low crime rate). Each variable will be explored in evaluated for relevance to the regression models.

Lastly, we will make observations that allow us to explore data transformations in the next section.

1.1 Characteristics of the Files

We load the data from a flat file and display its summary statistics below.

The training data set has 13 columns and 466 observations. The testing data has 12 columns on 40 observations. The omitted variable is obviously target.

The table below shows that the observations have no missing data values in the column named missing. This simplifies our data wrangling efforts immensely.

Training Data
skim_type skim_variable missing mean stdev median max
numeric zn 0 11.6 23.4 0.0 100.0
numeric indus 0 11.1 6.8 9.7 27.7
numeric chas 0 0.1 0.3 0.0 1.0
numeric nox 0 0.6 0.1 0.5 0.9
numeric rm 0 6.3 0.7 6.2 8.8
numeric age 0 68.4 28.3 77.2 100.0
numeric dis 0 3.8 2.1 3.2 12.1
numeric rad 0 9.5 8.7 5.0 24.0
numeric tax 0 409.5 167.9 334.5 711.0
numeric ptratio 0 18.4 2.2 18.9 22.0
numeric lstat 0 12.6 7.1 11.4 38.0
numeric medv 0 22.6 9.2 21.2 50.0
numeric target 0 0.5 0.5 0.0 1.0

1.2 Characteristics of the Variables

Next, let’s considering our expectations of how each variable should impact the target.

Expected Impact of Increase in Variable on Crime Rate
Variable Impact Comments
zn Lower Proportion of a town’s land zoned for lots greater than 25000 square feet. Large lots are expected to exclude cheaper smaller houses and have wealthier owners.
indus Higher Proportion of non-retail business acres per suburb associated with industry, noise, heavy draft and unpleasant visual effects.
chas Lower Charles River dummy variable = 1 if suburb borders the River, and =0 otherwise. chas means the amenities of a riverside location. A wealthier suburb has less crime.
nox Higher nox is the concentration of nitrogen oxide in the atmosphere. nox is pollution so an increase in nox might make the neighborhood less appealing and more crime ridden
rm Lower average number of rooms in a house is a proxy of wealth. An increase in rooms could be associated with lower crime.
age Higher the proportion of owner-occuplied units built before 1940. Unit age is related to structure quality. An older unit should be associated with less affluence
dis Lower weighted distance to five employment centers. Closer proximity to an urban center should be associated with more crime if crime is associated with urban centers.
rad Lower index of access to radial highways measures proximity to urban centers and ease of access. Easier access may allow criminals to travel in as well.
tax Lower Higher taxes should provide more resources to the suburb but could also reduce house values due to the higher tax burden. Both directions are plausible but I argue higher taxes may provide more policing
ptratio Higher A low Pupil-teacher ratio per school district implies greater wealth and is a proxy for school quality. Thus an increase in the ratio should imply higher crime rate.
lstat Higher A higher proportion of low status (less educated and less skilled workers) residents could imply a higher crime rate because of a greater level of poverty.
medv Lower A higher median property value might imply the neighborhood has a lower crime rate.

The above table uses but departs in some of its conclusions from the table IV of the paper by Harrison and Rubinfeld (1978) from which the original study was conducted.

After stating our prior expectations, we can use correlations, grouped box-plots and histograms to evaluate the relevance of each variable with respect to the crime rate target.

1.3 Correlations

Correlations

Correlations

The above matrix suggests that low absolute value of the correlation of a variable to target is less useful in a predictive model. However, variables closely correlated to other predictors may still be helpful either as proxies or as candidates for removal.

The dark red and dark blue cells of the correlation matrix hold the most interest for us because they represent the highest and lowest correlations. Based on their strong positive correlation to target, we conclude that lstat, tax, rad, dis, age, nox, indus are the most useful variables.

It is interesting for example that dis is negatively correlated to target suggesting that distant suburbs are safer than urban areas. Likewise, nox is positively correlated to target which suggests that pollution is an indicator of a higher crime rate.

However, looking at non-target correlations, we observe some interesting relationships too.

tax and rad are highly correlated. This suggests that proximity to roads is highly taxed. rm and medv are highly correlated. This suggests a common underlying wealth effect.
nox and indus are highly correlated suggesting that they may serve as a proxy for each other. nox and dis are strongly negative correlated. Further distance from the urban center means less air pollution.

However, these observations cannot stand alone. They require more granular analysis to make a decision on variable selection or preference in model building. So we consider their conditional distributions.

1.4 Conditional Distributions

Conditioning on the values of target, we examine the boxplots and histograms of the other variables. In order to view all of the fields simultaneously in boxplot, we pivot the data to a longer skinny format. The column names of the training data are converted to values of new columns except for column target.

Box Plot of each variable, colored by target

Box Plot of each variable, colored by target

In the above grouped box plots, we show the distribution of each variable conditional on the value of target. (In the chart, we name it as hicrime for technical reasons of how ggplot works.)

Some further conclusions can be drawn from the boxplots. The following variables are clearly useful:

  • dis and nox are clearly useful. Their conditional distributions are clearly different without very large tails.

  • tax is interesting and complicated by the large tail.

  • ptratio shows some differentiation in the expected direction.

The following variables are not useful:

  • chas is clearly not useful. All values are bunched together.

The following variables could be useful:

  • age shows a large left tail for High crime observations. Despite the clear difference in medians, the overlap in values between the High crime and Low crime boxplots make render age less useful.

  • zn, indus, lstat. The medians of the conditional distributions differ as expected.

Next, we examine the histograms.

Histogram of raw data variables

Histogram of raw data variables

2 Data Preparation

2.1 Starting with nox

To prepare the dataset and apply transformations that may improve model accuracy, we begin with a top-down approach. We choose the most promising variable with the highest explanatory power to discriminate hi and low crime areas. Then we manually assign thresholds using that variable to define classes.

We then solve a smaller problem on that subset observations where that first variable is inconclusive. By examining the conditional distribution of the other variables on that smaller subset, we can hope to find other clues to help partition the high and low crime observation.

The above diagram shows that nox is able to distinguish low crime suburbs when the variable is less than .48 and high crime areas when the variable is greater than 0.6. This leaves a smaller subset of points between 0.48 and 0.60 where further statistical analysis is required.

We look at that subpopulation in the following pairs-plot.

2.2 Trial and Error

Through trial and error, we determine the following sequence of classification steps seem to effectively identify high and low crime groupings.

If nox < 0.48, then assign to Low Crime. If nox > 0.60, then assign to High Crime. In the intermediate region, we consider other variables:
If rad > 6 then assign to High Crime. If rad <= 6 then use \(Tax2=(Tax-300)^2\) as an artificial variable. If \(Tax2 > 5000\) then we assign to Low Crime. Otherwise if ptratio > 18 then assign to High Crime Else assign to Low Crime

By the above thresholds are obtained from 3 simple observations: In the region of intermediate nox level, the next best variable for discriminating high and low crime is rad. Surprisingly, there is a cluster of high crime suburbs with a tax rate of 300 with no low crime suburbs in the same vicinity. Thus, the variable Tax2 is large when the tax rate is far from 300. When Tax2 is large, then we assign the suburb to a Low Crime category. Lastly, ptratio seems to discriminate by Crime for those remaining subset of suburbs unassigned by Tax2 value.

3 Model Building

3.1 A Complete Model

We consider a complete model keeping all the variables except those identified from the original model.

## 
## Call:
## glm(formula = target ~ zn + indus + nox + rm + age + dis + rad + 
##     tax + ptratio + lstat + medv, family = binomial(), data = training4)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8538  -0.1411  -0.0014   0.0026   3.4667  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -39.433599   6.470184  -6.095 1.10e-09 ***
## zn           -0.072337   0.034688  -2.085  0.03704 *  
## indus        -0.052045   0.045781  -1.137  0.25561    
## nox          47.332410   7.706465   6.142 8.15e-10 ***
## rm           -0.611244   0.721987  -0.847  0.39721    
## age           0.034866   0.013752   2.535  0.01123 *  
## dis           0.716434   0.228385   3.137  0.00171 ** 
## rad           0.716867   0.160848   4.457 8.32e-06 ***
## tax          -0.006894   0.002895  -2.381  0.01726 *  
## ptratio       0.377862   0.123881   3.050  0.00229 ** 
## lstat         0.053818   0.053060   1.014  0.31045    
## medv          0.183465   0.068417   2.682  0.00733 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 645.88  on 465  degrees of freedom
## Residual deviance: 193.52  on 454  degrees of freedom
## AIC: 217.52
## 
## Number of Fisher Scoring iterations: 9

3.2 A Model with Selected Variables

This model uses the newly created variables described in the previous section. We observe that all of the variables with the exception of IdptratioLow are significant.

## 
## Call:
## glm(formula = target ~ nox + rad + Tax2 + Idnox + Idrad + Idptratio, 
##     family = binomial(), data = training4)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.9228  -0.0198  -0.0020   0.1438   4.1770  
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -4.321e+00  5.976e+00  -0.723 0.469662    
## nox           2.085e+01  8.341e+00   2.499 0.012444 *  
## rad           5.530e-01  2.224e-01   2.486 0.012913 *  
## Tax2         -9.673e-05  3.244e-05  -2.982 0.002865 ** 
## IdnoxLow     -1.033e+01  2.602e+00  -3.972 7.13e-05 ***
## IdnoxMid     -4.278e+00  1.283e+00  -3.333 0.000858 ***
## IdradLow     -4.827e+00  1.974e+00  -2.445 0.014487 *  
## IdptratioLow -2.413e+00  6.424e-01  -3.757 0.000172 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 645.88  on 465  degrees of freedom
## Residual deviance: 158.98  on 458  degrees of freedom
## AIC: 174.98
## 
## Number of Fisher Scoring iterations: 9

3.3 A Univariate Model

## 
## Call:
## glm(formula = target ~ nox, family = binomial(), data = training4)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.2456  -0.3759  -0.1675   0.3707   2.5576  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -15.892      1.449  -10.97   <2e-16 ***
## nox           29.375      2.707   10.85   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 645.88  on 465  degrees of freedom
## Residual deviance: 292.01  on 464  degrees of freedom
## AIC: 296.01
## 
## Number of Fisher Scoring iterations: 6

4 Model Selection and Evaluation

We prefer to use the 2nd model with newly created variables. The fact that Idnox and Tax2 are significant and help to discriminate the high and low crime areas is significant. The first model, while larger, does not produce a sufficient small deviance compared to the 2nd model.

##            1            2            3            4            5 
## -11.33750564  -0.00238215  -0.00238215  -0.00238215  -0.30035961 
##            6            7            8            9           10 
##  -0.81162127  -0.81162127  -8.73437206  -8.73437206 -11.56818327 
##           11           12           13           14           15 
##  -1.42053054  -1.42053054   4.25744866   2.79023041   2.79023041 
##           16           17           18           19           20 
##  -1.07035870   3.91479698   3.97734074  -1.88557511 -13.02539674 
##           21           22           23           24           25 
## -11.23965481  -1.53223866   0.12589732  -0.39913762  -0.39913762 
##           26           27           28           29           30 
##  -0.48282677 -10.65362265  10.96309871   9.14932979   3.89185482 
##           31           32           33           34           35 
##  11.42175292  11.42175292  11.42175292  11.42175292  11.42175292 
##           36           37           38           39           40 
##  10.85885911  10.85885911   9.64967983   1.28723661  -0.99750277

5 Appendix

In this section, we provide all the code used in the assignment.

library(tufte)
library(tidyverse)
library(ggplot2)
library(knitr)
library(kableExtra)
library(skimr)
library(GGally)
knitr::opts_chunk$set(echo = FALSE, message=FALSE, warning=FALSE)


# Code Block: importing-csv
# Sets the working folder and imports the code into CSV

working_dir = "/Volumes/DATA/dat/ang/datascience/621_DATA_MINING/HW3"
setwd(working_dir)

training_file = "crime-training-data_modified.csv"

# The training data set
training = read.csv(training_file)

# The test data set
testing_file = "crime-evaluation-data_modified.csv"

testing = read.csv(testing_file)

#
# Use the skim library to display summary statistics in a rotated orientation.
#
skim(training) %>% focus(n_missing, numeric.mean, numeric.sd, numeric.p50 , numeric.p100) %>% 
  rename(missing = n_missing, mean = numeric.mean, 
         stdev=numeric.sd, median=numeric.p50, max=numeric.p100) %>%
  kable(caption="Training Data", digits=1) %>%
  kable_styling(latex_options = c("HOLD_position"))


variables  = c("zn", "indus", "chas", "nox", 
               "rm", "age", "dis", "rad", 
               "tax", "ptratio","lstat", "medv")

crime_impact = c("Lower", "Higher", "Lower", "Higher",
                 "Lower", "Higher" , "Lower", "Lower" ,
                 "Lower", "Higher" , "Higher", "Lower" )

comments = c("Proportion of a town's land zoned for lots greater than 25000 square feet.   Large lots are expected to exclude cheaper smaller houses and have wealthier owners.",
             "Proportion of non-retail business acres per suburb associated with industry, noise, heavy draft and unpleasant visual effects.", 
             "Charles River dummy variable = 1 if suburb borders the River, and =0 otherwise.  chas means the amenities of a riverside location.  A wealthier suburb has less crime.",
             "nox is the concentration of nitrogen oxide in the atmosphere.  nox is pollution so an increase in nox might make the neighborhood less appealing and more crime ridden",
             "average number of rooms in a house is a proxy of wealth.   An increase in rooms could be associated with lower crime.",
             "the proportion of owner-occuplied units built before 1940.  Unit age is related to structure quality.  An older unit should be associated with less affluence",
             "weighted distance to five employment centers.  Closer proximity to an urban center should be associated with more crime if crime is associated with urban centers.",
             "index of access to radial highways measures proximity to urban centers and ease of access.  Easier access may allow criminals to travel in as well.",
             "Higher taxes should provide more resources to the suburb but could also reduce house values due to the higher tax burden.  Both directions are plausible but I argue higher taxes may provide more policing",
             "A low Pupil-teacher ratio per school district implies greater wealth and is a proxy for school quality.  Thus an increase in the ratio should imply higher crime rate.",
             "A higher proportion of low status (less educated and less skilled workers) residents could imply a higher crime rate because of a greater level of poverty.",
             "A higher median property value might imply the neighborhood has a lower crime rate.")

expectations = data.frame(Variable = variables, Impact = crime_impact, Comments = comments )

expectations %>% kable(caption="Expected Impact of Increase in Variable on Crime Rate", booktabs=TRUE) %>%
  kable_styling(latex_options = c("HOLD_position"), full_width = T, bootstrap_options = c("bordered", "striped")) %>%
  column_spec(1, width=".8in") %>%
  column_spec(2, width=".8in")



ggcorr(training, nbreaks=5 , label=TRUE)

#
# In order to view all of the fields simultaneously in boxplot, 
# we pivot the data to a longer skinny format.
# The column names of the training data are converted to values 
# of new columns except for column **target**.
# The following transformation pivots the data to allow us to do boxplots.
df.m <- pivot_longer(training, -target, names_to ="field", values_to = "val")

df.m %>% mutate( hicrime = ifelse(target==1, "High", "Low") ) -> df2

# We also apply the same analysis to the test data:
# -----------------------------------------------------------------------------
training %>% mutate(hicrime = ifelse(target==1, "High", "Low")) -> training2


ggplot(data=df2, aes(x=field, y = val)) + 
  geom_boxplot( aes(fill=hicrime)) + 
  facet_wrap( ~field, ncol=3, scales = "free")
ggplot(data=df2, aes(x=val, fill=hicrime)) + geom_histogram()  + facet_wrap( ~field, scales="free")


# Show the ggpairs functionality
training2 %>% select(nox, dis, rad, tax, indus, ptratio, hicrime) %>% 
  ggpairs(., mapping=ggplot2::aes(color=hicrime) )
                      
library(gridExtra)


training2 %>% ggplot(aes(x=nox, y=indus, color=hicrime)) + geom_point() + scale_color_manual(breaks=c("Hi", "Lo"), values=c("red","blue")) -> p1

training2 %>% ggplot(aes(x=nox, fill=hicrime)) + geom_histogram() ->p2

grid.arrange(p1, p2, nrow=1)

training2 %>% mutate( Idnox = ifelse(nox > .6, "Hi", ifelse( nox < .48, "Low", "Mid"))) -> training3

training3 %>% filter( Idnox=="Mid") %>% select( dis, rad, tax, indus, ptratio, hicrime) %>% 
  ggpairs(., mapping=ggplot2::aes(color=hicrime) )
training3 %>% mutate( Idrad = ifelse(rad > 6, "High", "Low"),
                      Tax2 = (tax - 300)^2 ,
                      Idptratio = ifelse(ptratio > 18, "High", "Low")
                      ) -> training4


library(alr3)

lm1 = glm( target ~ zn + indus + nox + rm + age + dis + rad + tax + ptratio + lstat + medv, 
           family=binomial(), 
           data=training4 )

summary(lm1)

mmps(lm1)
mod2 = glm( target ~ nox + rad + Tax2 + Idnox + Idrad + Idptratio, family=binomial(), data=training4 )

summary(mod2)


mod3 = glm( target ~ nox , family = binomial(), data=training4)

summary(mod3)


testing  %>% mutate( Idnox = ifelse(nox > .6, "Hi", ifelse( nox < .48, "Low", "Mid"))) -> testing3
testing3 %>% mutate( Idrad = ifelse(rad > 6, "High", "Low"),
                      Tax2 = (tax - 300)^2 ,
                      Idptratio = ifelse(ptratio > 18, "High", "Low")
                      ) -> testing4

(predictions = predict(mod2, testing4))
write(predictions, "evaluation_predictions.csv")