This assignment explores, analyzes and models a data set containing information on crime and demographic details of the neighborhoods of a city. Each record show a target variable indicating whether the crime rate is above(1) or below(0) the city median crime rate. We build a set of binary logistic regression models using only the provided data. We also generate and evaluate the models on a test data set.
The paper is organized as follows. In the first section below, we conduct exploratory data analysis of the variables. In the second section, we prepare data transformations needed to build useful secondary variables. In the third section, we construct three different binary logistic regression models. In the fourth section, we select a model and use classification metrics to assess the optimal choice. The appendix contains the full description of the code.
The exploratory data analysis proceeds as follows.
We first consider summary statistics of the data file and look for missing or exceptional values. Then we evaluate the individual variables systematically in relation to the target variable (high vs. low crime rate). Each variable will be explored in evaluated for relevance to the regression models.
Lastly, we will make observations that allow us to explore data transformations in the next section.
We load the data from a flat file and display its summary statistics below.
The training data set has 13 columns and 466 observations. The testing data has 12 columns on 40 observations. The omitted variable is obviously target.
The table below shows that the observations have no missing data values in the column named missing. This simplifies our data wrangling efforts immensely.
| skim_type | skim_variable | missing | mean | stdev | median | max |
|---|---|---|---|---|---|---|
| numeric | zn | 0 | 11.6 | 23.4 | 0.0 | 100.0 |
| numeric | indus | 0 | 11.1 | 6.8 | 9.7 | 27.7 |
| numeric | chas | 0 | 0.1 | 0.3 | 0.0 | 1.0 |
| numeric | nox | 0 | 0.6 | 0.1 | 0.5 | 0.9 |
| numeric | rm | 0 | 6.3 | 0.7 | 6.2 | 8.8 |
| numeric | age | 0 | 68.4 | 28.3 | 77.2 | 100.0 |
| numeric | dis | 0 | 3.8 | 2.1 | 3.2 | 12.1 |
| numeric | rad | 0 | 9.5 | 8.7 | 5.0 | 24.0 |
| numeric | tax | 0 | 409.5 | 167.9 | 334.5 | 711.0 |
| numeric | ptratio | 0 | 18.4 | 2.2 | 18.9 | 22.0 |
| numeric | lstat | 0 | 12.6 | 7.1 | 11.4 | 38.0 |
| numeric | medv | 0 | 22.6 | 9.2 | 21.2 | 50.0 |
| numeric | target | 0 | 0.5 | 0.5 | 0.0 | 1.0 |
Next, let’s considering our expectations of how each variable should impact the target.
| Variable | Impact | Comments |
|---|---|---|
| zn | Lower | Proportion of a town’s land zoned for lots greater than 25000 square feet. Large lots are expected to exclude cheaper smaller houses and have wealthier owners. |
| indus | Higher | Proportion of non-retail business acres per suburb associated with industry, noise, heavy draft and unpleasant visual effects. |
| chas | Lower | Charles River dummy variable = 1 if suburb borders the River, and =0 otherwise. chas means the amenities of a riverside location. A wealthier suburb has less crime. |
| nox | Higher | nox is the concentration of nitrogen oxide in the atmosphere. nox is pollution so an increase in nox might make the neighborhood less appealing and more crime ridden |
| rm | Lower | average number of rooms in a house is a proxy of wealth. An increase in rooms could be associated with lower crime. |
| age | Higher | the proportion of owner-occuplied units built before 1940. Unit age is related to structure quality. An older unit should be associated with less affluence |
| dis | Lower | weighted distance to five employment centers. Closer proximity to an urban center should be associated with more crime if crime is associated with urban centers. |
| rad | Lower | index of access to radial highways measures proximity to urban centers and ease of access. Easier access may allow criminals to travel in as well. |
| tax | Lower | Higher taxes should provide more resources to the suburb but could also reduce house values due to the higher tax burden. Both directions are plausible but I argue higher taxes may provide more policing |
| ptratio | Higher | A low Pupil-teacher ratio per school district implies greater wealth and is a proxy for school quality. Thus an increase in the ratio should imply higher crime rate. |
| lstat | Higher | A higher proportion of low status (less educated and less skilled workers) residents could imply a higher crime rate because of a greater level of poverty. |
| medv | Lower | A higher median property value might imply the neighborhood has a lower crime rate. |
The above table uses but departs in some of its conclusions from the table IV of the paper by Harrison and Rubinfeld (1978) from which the original study was conducted.
After stating our prior expectations, we can use correlations, grouped box-plots and histograms to evaluate the relevance of each variable with respect to the crime rate target.
Correlations
The above matrix suggests that low absolute value of the correlation of a variable to target is less useful in a predictive model. However, variables closely correlated to other predictors may still be helpful either as proxies or as candidates for removal.
The dark red and dark blue cells of the correlation matrix hold the most interest for us because they represent the highest and lowest correlations. Based on their strong positive correlation to target, we conclude that lstat, tax, rad, dis, age, nox, indus are the most useful variables.
It is interesting for example that dis is negatively correlated to target suggesting that distant suburbs are safer than urban areas. Likewise, nox is positively correlated to target which suggests that pollution is an indicator of a higher crime rate.
However, looking at non-target correlations, we observe some interesting relationships too.
tax and rad are highly correlated. This suggests that proximity to roads is highly taxed. rm and medv are highly correlated. This suggests a common underlying wealth effect.
nox and indus are highly correlated suggesting that they may serve as a proxy for each other. nox and dis are strongly negative correlated. Further distance from the urban center means less air pollution.
However, these observations cannot stand alone. They require more granular analysis to make a decision on variable selection or preference in model building. So we consider their conditional distributions.
Conditioning on the values of target, we examine the boxplots and histograms of the other variables. In order to view all of the fields simultaneously in boxplot, we pivot the data to a longer skinny format. The column names of the training data are converted to values of new columns except for column target.
Box Plot of each variable, colored by target
In the above grouped box plots, we show the distribution of each variable conditional on the value of target. (In the chart, we name it as hicrime for technical reasons of how ggplot works.)
Some further conclusions can be drawn from the boxplots. The following variables are clearly useful:
dis and nox are clearly useful. Their conditional distributions are clearly different without very large tails.
tax is interesting and complicated by the large tail.
ptratio shows some differentiation in the expected direction.
The following variables are not useful:
The following variables could be useful:
age shows a large left tail for High crime observations. Despite the clear difference in medians, the overlap in values between the High crime and Low crime boxplots make render age less useful.
zn, indus, lstat. The medians of the conditional distributions differ as expected.
Next, we examine the histograms.
Histogram of raw data variables
To prepare the dataset and apply transformations that may improve model accuracy, we begin with a top-down approach. We choose the most promising variable with the highest explanatory power to discriminate hi and low crime areas. Then we manually assign thresholds using that variable to define classes.
We then solve a smaller problem on that subset observations where that first variable is inconclusive. By examining the conditional distribution of the other variables on that smaller subset, we can hope to find other clues to help partition the high and low crime observation.
The above diagram shows that nox is able to distinguish low crime suburbs when the variable is less than .48 and high crime areas when the variable is greater than 0.6. This leaves a smaller subset of points between 0.48 and 0.60 where further statistical analysis is required.
We look at that subpopulation in the following pairs-plot.
Through trial and error, we determine the following sequence of classification steps seem to effectively identify high and low crime groupings.
If nox < 0.48, then assign to Low Crime. If nox > 0.60, then assign to High Crime. In the intermediate region, we consider other variables:
If rad > 6 then assign to High Crime. If rad <= 6 then use \(Tax2=(Tax-300)^2\) as an artificial variable. If \(Tax2 > 5000\) then we assign to Low Crime. Otherwise if ptratio > 18 then assign to High Crime Else assign to Low Crime
By the above thresholds are obtained from 3 simple observations: In the region of intermediate nox level, the next best variable for discriminating high and low crime is rad. Surprisingly, there is a cluster of high crime suburbs with a tax rate of 300 with no low crime suburbs in the same vicinity. Thus, the variable Tax2 is large when the tax rate is far from 300. When Tax2 is large, then we assign the suburb to a Low Crime category. Lastly, ptratio seems to discriminate by Crime for those remaining subset of suburbs unassigned by Tax2 value.
We consider a complete model keeping all the variables except those identified from the original model.
library(alr3)
lm1 = glm( target ~ zn + indus + nox + rm + age + dis + rad + tax + ptratio + lstat + medv,
family=binomial(),
data=training4 )
summary(lm1)##
## Call:
## glm(formula = target ~ zn + indus + nox + rm + age + dis + rad +
## tax + ptratio + lstat + medv, family = binomial(), data = training4)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.8538 -0.1411 -0.0014 0.0026 3.4667
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -39.433599 6.470184 -6.095 1.10e-09 ***
## zn -0.072337 0.034688 -2.085 0.03704 *
## indus -0.052045 0.045781 -1.137 0.25561
## nox 47.332410 7.706465 6.142 8.15e-10 ***
## rm -0.611244 0.721987 -0.847 0.39721
## age 0.034866 0.013752 2.535 0.01123 *
## dis 0.716434 0.228385 3.137 0.00171 **
## rad 0.716867 0.160848 4.457 8.32e-06 ***
## tax -0.006894 0.002895 -2.381 0.01726 *
## ptratio 0.377862 0.123881 3.050 0.00229 **
## lstat 0.053818 0.053060 1.014 0.31045
## medv 0.183465 0.068417 2.682 0.00733 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 645.88 on 465 degrees of freedom
## Residual deviance: 193.52 on 454 degrees of freedom
## AIC: 217.52
##
## Number of Fisher Scoring iterations: 9
This model uses the newly created variables described in the previous section. We observe that all of the variables with the exception of IdptratioLow are significant.
mod2 = glm( target ~ nox + rad + Tax2 + Idnox + Idrad + Idptratio, family=binomial(), data=training4 )
summary(mod2)##
## Call:
## glm(formula = target ~ nox + rad + Tax2 + Idnox + Idrad + Idptratio,
## family = binomial(), data = training4)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.9228 -0.0198 -0.0020 0.1438 4.1770
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.321e+00 5.976e+00 -0.723 0.469662
## nox 2.085e+01 8.341e+00 2.499 0.012444 *
## rad 5.530e-01 2.224e-01 2.486 0.012913 *
## Tax2 -9.673e-05 3.244e-05 -2.982 0.002865 **
## IdnoxLow -1.033e+01 2.602e+00 -3.972 7.13e-05 ***
## IdnoxMid -4.278e+00 1.283e+00 -3.333 0.000858 ***
## IdradLow -4.827e+00 1.974e+00 -2.445 0.014487 *
## IdptratioLow -2.413e+00 6.424e-01 -3.757 0.000172 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 645.88 on 465 degrees of freedom
## Residual deviance: 158.98 on 458 degrees of freedom
## AIC: 174.98
##
## Number of Fisher Scoring iterations: 9
##
## Call:
## glm(formula = target ~ nox, family = binomial(), data = training4)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.2456 -0.3759 -0.1675 0.3707 2.5576
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -15.892 1.449 -10.97 <2e-16 ***
## nox 29.375 2.707 10.85 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 645.88 on 465 degrees of freedom
## Residual deviance: 292.01 on 464 degrees of freedom
## AIC: 296.01
##
## Number of Fisher Scoring iterations: 6
We prefer to use the 2nd model with newly created variables. The fact that Idnox and Tax2 are significant and help to discriminate the high and low crime areas is significant. The first model, while larger, does not produce a sufficient small deviance compared to the 2nd model.
testing %>% mutate( Idnox = ifelse(nox > .6, "Hi", ifelse( nox < .48, "Low", "Mid"))) -> testing3
testing3 %>% mutate( Idrad = ifelse(rad > 6, "High", "Low"),
Tax2 = (tax - 300)^2 ,
Idptratio = ifelse(ptratio > 18, "High", "Low")
) -> testing4
(predictions = predict(mod2, testing4))## 1 2 3 4 5
## -11.33750564 -0.00238215 -0.00238215 -0.00238215 -0.30035961
## 6 7 8 9 10
## -0.81162127 -0.81162127 -8.73437206 -8.73437206 -11.56818327
## 11 12 13 14 15
## -1.42053054 -1.42053054 4.25744866 2.79023041 2.79023041
## 16 17 18 19 20
## -1.07035870 3.91479698 3.97734074 -1.88557511 -13.02539674
## 21 22 23 24 25
## -11.23965481 -1.53223866 0.12589732 -0.39913762 -0.39913762
## 26 27 28 29 30
## -0.48282677 -10.65362265 10.96309871 9.14932979 3.89185482
## 31 32 33 34 35
## 11.42175292 11.42175292 11.42175292 11.42175292 11.42175292
## 36 37 38 39 40
## 10.85885911 10.85885911 9.64967983 1.28723661 -0.99750277
In this section, we provide all the code used in the assignment.
library(tufte)
library(tidyverse)
library(ggplot2)
library(knitr)
library(kableExtra)
library(skimr)
library(GGally)
knitr::opts_chunk$set(echo = FALSE, message=FALSE, warning=FALSE)
# Code Block: importing-csv
# Sets the working folder and imports the code into CSV
working_dir = "/Volumes/DATA/dat/ang/datascience/621_DATA_MINING/HW3"
setwd(working_dir)
training_file = "crime-training-data_modified.csv"
# The training data set
training = read.csv(training_file)
# The test data set
testing_file = "crime-evaluation-data_modified.csv"
testing = read.csv(testing_file)
#
# Use the skim library to display summary statistics in a rotated orientation.
#
skim(training) %>% focus(n_missing, numeric.mean, numeric.sd, numeric.p50 , numeric.p100) %>%
rename(missing = n_missing, mean = numeric.mean,
stdev=numeric.sd, median=numeric.p50, max=numeric.p100) %>%
kable(caption="Training Data", digits=1) %>%
kable_styling(latex_options = c("HOLD_position"))
variables = c("zn", "indus", "chas", "nox",
"rm", "age", "dis", "rad",
"tax", "ptratio","lstat", "medv")
crime_impact = c("Lower", "Higher", "Lower", "Higher",
"Lower", "Higher" , "Lower", "Lower" ,
"Lower", "Higher" , "Higher", "Lower" )
comments = c("Proportion of a town's land zoned for lots greater than 25000 square feet. Large lots are expected to exclude cheaper smaller houses and have wealthier owners.",
"Proportion of non-retail business acres per suburb associated with industry, noise, heavy draft and unpleasant visual effects.",
"Charles River dummy variable = 1 if suburb borders the River, and =0 otherwise. chas means the amenities of a riverside location. A wealthier suburb has less crime.",
"nox is the concentration of nitrogen oxide in the atmosphere. nox is pollution so an increase in nox might make the neighborhood less appealing and more crime ridden",
"average number of rooms in a house is a proxy of wealth. An increase in rooms could be associated with lower crime.",
"the proportion of owner-occuplied units built before 1940. Unit age is related to structure quality. An older unit should be associated with less affluence",
"weighted distance to five employment centers. Closer proximity to an urban center should be associated with more crime if crime is associated with urban centers.",
"index of access to radial highways measures proximity to urban centers and ease of access. Easier access may allow criminals to travel in as well.",
"Higher taxes should provide more resources to the suburb but could also reduce house values due to the higher tax burden. Both directions are plausible but I argue higher taxes may provide more policing",
"A low Pupil-teacher ratio per school district implies greater wealth and is a proxy for school quality. Thus an increase in the ratio should imply higher crime rate.",
"A higher proportion of low status (less educated and less skilled workers) residents could imply a higher crime rate because of a greater level of poverty.",
"A higher median property value might imply the neighborhood has a lower crime rate.")
expectations = data.frame(Variable = variables, Impact = crime_impact, Comments = comments )
expectations %>% kable(caption="Expected Impact of Increase in Variable on Crime Rate", booktabs=TRUE) %>%
kable_styling(latex_options = c("HOLD_position"), full_width = T, bootstrap_options = c("bordered", "striped")) %>%
column_spec(1, width=".8in") %>%
column_spec(2, width=".8in")
ggcorr(training, nbreaks=5 , label=TRUE)
#
# In order to view all of the fields simultaneously in boxplot,
# we pivot the data to a longer skinny format.
# The column names of the training data are converted to values
# of new columns except for column **target**.
# The following transformation pivots the data to allow us to do boxplots.
df.m <- pivot_longer(training, -target, names_to ="field", values_to = "val")
df.m %>% mutate( hicrime = ifelse(target==1, "High", "Low") ) -> df2
# We also apply the same analysis to the test data:
# -----------------------------------------------------------------------------
training %>% mutate(hicrime = ifelse(target==1, "High", "Low")) -> training2
ggplot(data=df2, aes(x=field, y = val)) +
geom_boxplot( aes(fill=hicrime)) +
facet_wrap( ~field, ncol=3, scales = "free")
ggplot(data=df2, aes(x=val, fill=hicrime)) + geom_histogram() + facet_wrap( ~field, scales="free")
# Show the ggpairs functionality
training2 %>% select(nox, dis, rad, tax, indus, ptratio, hicrime) %>%
ggpairs(., mapping=ggplot2::aes(color=hicrime) )
library(gridExtra)
training2 %>% ggplot(aes(x=nox, y=indus, color=hicrime)) + geom_point() + scale_color_manual(breaks=c("Hi", "Lo"), values=c("red","blue")) -> p1
training2 %>% ggplot(aes(x=nox, fill=hicrime)) + geom_histogram() ->p2
grid.arrange(p1, p2, nrow=1)
training2 %>% mutate( Idnox = ifelse(nox > .6, "Hi", ifelse( nox < .48, "Low", "Mid"))) -> training3
training3 %>% filter( Idnox=="Mid") %>% select( dis, rad, tax, indus, ptratio, hicrime) %>%
ggpairs(., mapping=ggplot2::aes(color=hicrime) )
training3 %>% mutate( Idrad = ifelse(rad > 6, "High", "Low"),
Tax2 = (tax - 300)^2 ,
Idptratio = ifelse(ptratio > 18, "High", "Low")
) -> training4
library(alr3)
lm1 = glm( target ~ zn + indus + nox + rm + age + dis + rad + tax + ptratio + lstat + medv,
family=binomial(),
data=training4 )
summary(lm1)
mmps(lm1)
mod2 = glm( target ~ nox + rad + Tax2 + Idnox + Idrad + Idptratio, family=binomial(), data=training4 )
summary(mod2)
mod3 = glm( target ~ nox , family = binomial(), data=training4)
summary(mod3)
testing %>% mutate( Idnox = ifelse(nox > .6, "Hi", ifelse( nox < .48, "Low", "Mid"))) -> testing3
testing3 %>% mutate( Idrad = ifelse(rad > 6, "High", "Low"),
Tax2 = (tax - 300)^2 ,
Idptratio = ifelse(ptratio > 18, "High", "Low")
) -> testing4
(predictions = predict(mod2, testing4))
write(predictions, "evaluation_predictions.csv")