Introduction

The following is a description of the task at hand as given by Marcus Ellis.

In this homework assignment, you will explore, analyze and model a data set containing information on crime for various neighborhoods of a major city. Each record has a response variable indicating whether or not the crime rate is above the median crime rate (1) or not (0).

Your objective is to build a binary logistic regression model on the training data set to predict whether the neighborhood will be at risk for high crime levels. You will provide classifications and probabilities for the evaluation data set using your binary logistic regression model. You can only use the variables given to you (or variables that you derive from the variables provided). Below is a short description of the variables of interest in the data set:

  1. zn: proportion of residential land zoned for large lots (over 25000 square feet) (predictor variable)
  2. indus: proportion of non-retail business acres per suburb (predictor variable)
  3. chas: a dummy var. for whether the suburb borders the Charles River (1) or not (0) (predictor variable)
  4. nox: nitrogen oxides concentration (parts per 10 million) (predictor variable)
  5. rm: average number of rooms per dwelling (predictor variable)
  6. age: proportion of owner-occupied units built prior to 1940 (predictor variable)
  7. dis: weighted mean of distances to five Boston employment centers (predictor variable)
  8. rad: index of accessibility to radial highways (predictor variable)
  9. tax: full-value property-tax rate per $10,000 (predictor variable)
  10. ptratio: pupil-teacher ratio by town (predictor variable)
  11. black: 1000(Bk - 0.63)2 where Bk is the proportion of blacks by town (predictor variable)
  12. lstat: lower status of the population (percent) (predictor variable)
  13. medv: median value of owner-occupied homes in $1000s (predictor variable)
  14. target: whether the crime rate is above the median crime rate (1) or not (0) (response variable)

Libraries

For this analysis, we will use tidyverse libraries including ggplot2, dplyr, and tidyr.

We will also use the caret and pROC packages.

Data exploration

Basic data exploration

First, we will read in the data, available at my Github here:

https://raw.githubusercontent.com/heathergeiger/Data621_hw3/master/crime-training-data.csv

Once the data is read in, we check number of non-NA values for each numeric variable.

For binary variables, we run the table function to check number of 0’s vs. 1’s.

## [1] "Total number of records in data:"
## [1] 466
## [1] "Number of non-NA values per variable in data:"
##    Variable   n
## 1       age 466
## 2     black 466
## 3       dis 466
## 4     indus 466
## 5     lstat 466
## 6      medv 466
## 7       nox 466
## 8   ptratio 466
## 9       rad 466
## 10       rm 466
## 11      tax 466
## 12       zn 466
## [1] "Frequency of variable chas 0 vs. 1:"
## 
##   0   1 
## 433  33
## [1] "Frequency of variable target 0 vs. 1:"
## 
##   0   1 
## 237 229

We find that there are no NAs in the numeric or factor variables, so we do not need to worry about filling in any missing values.

Next, let’s check the number of unique values per variable.

##    Variable Num.uniques
## 3      chas           2
## 12   target           2
## 10      rad           9
## 14       zn          26
## 9   ptratio          46
## 13      tax          63
## 5     indus          73
## 8       nox          79
## 7      medv         218
## 2     black         331
## 1       age         333
## 4       dis         380
## 11       rm         419
## 6     lstat         424

We find that several variables appear to have many fewer unique values than the number of records.

Deeper exploration of repeated values within variables

We saw in particular that variables rad, zn, ptratio, tax, indus, and nox had a lot of repeated values.

Let’s check how many unique combinations we get if we look at combinations of these variables.

## [1] "Number of unique combinations of rad/tax/indus/ptratio/zn:"
## [1] 75
## [1] "Number of unique combinations of rad/tax/indus/ptratio/zn/nox:"
## [1] 95

Now, look in more detail at why there are more unique records if we include nox.

##    Rad.tax.indus.ptratio.zn Num.separate.values.nox
## 18       24/666/18.1/20.2/0                      18
## 55         5/264/3.97/13/20                       2
## 66       5/403/19.58/14.7/0                       2
## 75         8/307/6.2/17.4/0                       2

For nox, a number of different values are associated when rad = 24. We also find two distinct values each for nox for three other combinations of rad/tax/indus/ptratio/zn.

Hypothesis to explain multi-neighborhood groups

My hypothesis is that neighborhoods that share a common value for rad, tax, indus, ptratio, and zn are all part of the same municipality.

I will proceed based on this hypothesis for the remainder of this analysis.

Proportion of high vs. low-crime neighborhoods per municipality

First, let’s print details on the number of neighborhoods per municipality.

## [1] "Number of neighborhoods per municipality:"
## 
##   1   2   3   4   5   6   7   8   9  10  11  12  14  16  19  28 121 
##  15  14  11   9   5   4   4   1   3   1   2   1   1   1   1   1   1

Now check the number of high vs. low-crime neighborhoods per municipality.

## [1] "Table for proportions more frequent neighborhood type:"
## 
## 0.6667    0.7 0.7143 0.7778    0.8 0.8182 0.9091 0.9286      1 
##      1      1      1      1      1      1      1      1     67

We find that most municipalities either have all high-crime or all low-crime neighborhoods rather than a mix.

We find that the number of neighborhoods in a municipality is extremely correlated with being majority high-crime.

In fact, we find that in the four municipalities with at least 15 neighborhoods, all neighborhoods are high-crime.

Maybe we can use the municipality information to guide our model. To do this, though, we would want to be sure at least some of the same municipalities are in the evaluation data. Let’s check.

Read in the evaluation data from here:

https://raw.githubusercontent.com/heathergeiger/Data621_hw3/master/crime-evaluation-data.csv

## [1] "Number of records in evaluation:"
## [1] 40
## [1] "Number of records matching 121-neighborhood municipality from training:"
## [1] 11
## [1] "Number of records matching 28-neighborhood municipality from training:"
## [1] 2
## [1] "Number of records matching 19-neighborhood municipality from training:"
## [1] 3
## [1] "Number of records matching 16-neighborhood municipality from training:"
## [1] 2

It seems the evaluation data contains data from some of the same municipalities as the training data.

Summary of how to use municipality data based on this exploration

It seems the main use of the municipality data would be to match up each neighborhood to whether it is part of a municipality with many neighborhoods, vs. only a few. Here it seems reasonable to set a cutoff of a municipality being considered large at 15 or more neighborhoods for that municipality in the training data. We will use variables rad, tax, indus, ptratio, and zn to determine what municipality each neighborhood is from. Then create a binary variable “big.city” that says whether or not the neighborhood is in a municipality with 15+ neighborhoods.

After this, discard all municipality-level variables and focus on neighborhood-level variables to fill in the rest of the model.

Exploration of remaining variables

We still have a number of other variables to look at, which we are assuming were measured at the neighborhood rather than municipality level. To review, the variables left after removing municipality-level ones are:

  1. chas
  2. rm
  3. age
  4. dis
  5. black
  6. lstat
  7. medv

Variable “chas” is a factor variable, which we already saw earlier was 1 for 33/466 records.

Before even look at variable “black”, we see in the definition that is a transformed variable with an unusual relationship to the original variable. As the original variable is a proportion, it must be between 0 and 1. If we plug 1 into the equation 1000 * (x - 0.63)^2, we get 136.9. We also get 136.9 when x (proportion of black residents) is 0.26. When “black” is > 136.9 and <= 396.9, the two roots of the equation will be 0 <= x < 0.26 and x > 1, so we can reasonably infer that the actual proportion is < 0.26. When “black” is <= 136.9, the actual proportion has a degree of uncertainty, with more uncertainty as the value is higher.

Therefore as long as the range in our data for this from 0 to 396.9, it seems reasonable to convert this variable to a binary of either proportion of black residents < 0.26, or >= 0.26.

Let’s just check this.

## [1]   0.32 396.90

Good to go!

Now, let’s look at the distribution of rm, age, dis, lstat, and medv.

We find that dis, lstat, and medv are somewhat right-skewed, while age is left-skewed.

Initial transformations before further exploration

Before we can fully explore the correlation between variables, we really need to create our “big city” binary variable and convert “black” to binary. Let’s do that now.

Note, here we will make black a value of 1 if the proportion of black residents is 26% or higher.

Correlations between predictor variables

To review, our new sets of predictor variables are:

  1. chas (binary)
  2. rm (numeric)
  3. age (numeric)
  4. dis(numeric)
  5. black (binary)
  6. lstat (numeric)
  7. medv (numeric)
  8. big.city (binary)

Let’s look at the correlations between all of these.

First, look at the co-occurence of the three pairs of binary variables.

## [1] "Variables chas vs. black:"
##                        black
## chas                    Black.residents.26percent.plus
##   Borders.Charles                                    1
##   Doesnt.border.Charles                             32
##                        black
## chas                    Black.residents.under.26percent
##   Borders.Charles                                    32
##   Doesnt.border.Charles                             401
## [1] "Variables chas vs. big.city:"
##                        big_city
## chas                    Big.city Small.city
##   Borders.Charles             19         14
##   Doesnt.border.Charles      165        268
## [1] "Variables black vs. big.city:"
##                                  big_city
## black                             Big.city Small.city
##   Black.residents.26percent.plus        32          1
##   Black.residents.under.26percent      152        281

We find that nearly all neighborhoods with 26%+ black residents are part of a big city. And we already found that a neighborhood being part of a big city is a 100% predictor for having higher than median crime rate in this dataset. So this variable does not really add much additional information and should probably be removed.

What about the association of bordering the Charles River and being part of a big city versus numeric variables?

We find that bordering the Charles River appears to be negatively associated with distance to employment centers, where neighborhoods that border the Charles are closer to jobs. It is positively associated with median home value.

We see noticeable differences between neighborhoods in small vs. big municipalities for pretty much all of the numeric variables.

Now, make scatterplots showing correlation between pairs of numeric variables. Since we see such dramatic differences between big and small cities, let’s separate them.

It seems that the relationship between lstat and medv is non-linear.

Perhaps log-transformation could be used to convert their relationship to linear?

Let’s use log2-transformation, as it is a bit easier to understand for the range of our data.

Add a pseudo-count of 1 to lstat so that there are no NAs.

And the correlation of both as log2-transformed:

It seems like log-transformation of these variables (at least of lstat especially) may be a good idea before including in the model.

Correlation of predictor variables with target

Let’s check if there appears to be an association between each predictor variable with the target by making some more boxplots.

Since 100% of neighborhoods in big cities have target = 1, use only small-city neighborhoods to create the boxplot.

For variable chas, make a simple contingency table.

## [1] "As raw numbers:"
##                        target_small_cities
## chas_small_cities       Above.median.crime Below.median.crime
##   Borders.Charles                        2                 12
##   Doesnt.border.Charles                 43                225
## [1] "As proportions:"
##                        target_small_cities
## chas_small_cities       Above.median.crime Below.median.crime
##   Borders.Charles                   0.1429             0.8571
##   Doesnt.border.Charles             0.1604             0.8396

Bordering the Charles River does not appear to be associated with a neighborhood being high-crime within small municipalities. The proportion of high-crime neighborhoods in small municipalities that do border the Charles River is nearly identical to the proportion in small municipalities that do not.

Variables age, dis, and lstat appear correlated with whether or not a neighborhood within a small municipality has a high crime rate. We also see a correlation after log-transforming lstat.

Based on just these boxplots, rm and medv (whether log-transformed or not) do not appear very correlated with our target variable.

Let’s also try looking at a contingency table of target vs. a few different levels of these variables.

##                       
## rm_small_cities          0   1
##   large (>=6.5, < 7.5)  73   9
##   med (>= 5.5, < 6.5)  148  25
##   small (<5.5)           4   6
##   XL (>= 7.5)           12   5
##                       
## rm_small_cities             0      1
##   large (>=6.5, < 7.5) 0.8902 0.1098
##   med (>= 5.5, < 6.5)  0.8555 0.1445
##   small (<5.5)         0.4000 0.6000
##   XL (>= 7.5)          0.7059 0.2941
##                         
## medv_small_cities          0   1
##   high (>= 30)            49  12
##   low (< 15)               6   2
##   low-med (>= 15, < 20)   44  17
##   med (>= 20, < 25)      106  14
##   med-high (>= 25, < 30)  32   0
##                         
## medv_small_cities             0      1
##   high (>= 30)           0.8033 0.1967
##   low (< 15)             0.7500 0.2500
##   low-med (>= 15, < 20)  0.7213 0.2787
##   med (>= 20, < 25)      0.8833 0.1167
##   med-high (>= 25, < 30) 1.0000 0.0000

While the cutoffs we created for grouping were somewhat arbitrary, it appears that there may be something here.

Mainly, it appears that having a very low average room size is positively correlated with having a higher crime rate. However, once you have even a modest average room size, it seems there is not much of an additional correlation as you increase the number of rooms beyond a minimal level.

Similar idea for median home value. It seems having a very low median home value is positively correlated with crime, but the probability does not necessarily increase linearly once you get beyond a certain level.

Let’s use density plots to get a better sense of where might be good to place cutoffs, if we decide to convert these variables to binary.

It actually looks like 5.5 was a pretty good cutoff for rm.

For medv, 20 looks like a reasonable cutoff.

One last check - what will the correlation between rm and medv be if we convert to binary using these cutoffs?

##                      medv_binary
## rm_binary             Higher.value.home Lower.value.home
##   Larger.average.home               211               61
##   Small.average.home                  2                8

We find the majority of neighborhoods with smaller average homes also have fairly low home values. This makes sense.

I think we now have a pretty good understanding of the data, and are ready to move on to the transformation step.

Data transformation

Summary of transformations already completed

To summarize, we have already run the following transformations.

  1. Create variable “big.city” that says whether a neighborhood is part of a large municipality or not.
  2. Remove municipality-level variables other than “big.city”.
  3. Convert variable “black” to binary.

Additional transformations to run

Before modeling, we should run the following additional transformations:

  1. Remove variable “black”.
  2. Remove variable “chas”, since it appears to be not at all correlated with the target in small cities.
  3. Convert variables “rm” and “medv” to binary based on cutoffs of 5.5 and 20 respectively. Switch names to small.rm (where 1 = rm < 5.5) and low.medv (where 1 = medv < 20).

Just to make things simpler later on (will work the same either way), let’s also convert all binary variables except “target” to numeric (e.g. have them as actual numbers 0 and 1 rather than factors).

Modeling

From here on out, we will approach this analysis using two main approaches:

  1. Include variable big.city in the model, so that all neighborhoods in big cities will automatically be counted as having a 100% probability of high crime. Then, coefficients for the remaining variables will be based almost entirely only on data for when big.city = 0.
  2. Exclude variable big.city from the model, instead trying to model only based on variables that do not require municipality information. This will result in a model that is less strictly accurate going just based on this dataset, but may be more generally applicable to other datasets.

Within both approaches, we may use methods like forward or backward selection to choose which other predictor variables to include.

Approach 1 - Include big.city as a covariate.

Let’s start by simply modeling all variables including big.city.

## 
## Call:
## glm(formula = target ~ ., family = "binomial", data = crime)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8465  -0.3440  -0.0697   0.0000   3.3262  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)   
## (Intercept)   -0.98101    1.27100  -0.772  0.44021   
## age            0.02166    0.01175   1.844  0.06521 . 
## dis           -0.52263    0.17706  -2.952  0.00316 **
## lstat         -0.02084    0.04877  -0.427  0.66923   
## big.city      21.22864 1233.69514   0.017  0.98627   
## small.rm       1.69383    0.92622   1.829  0.06744 . 
## low.medv       0.07922    0.48186   0.164  0.86941   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 645.88  on 465  degrees of freedom
## Residual deviance: 188.00  on 459  degrees of freedom
## AIC: 202
## 
## Number of Fisher Scoring iterations: 19

Due to quirks of how logistic regression p-values are calculated, big.city is actually technically not significant despite target always being equal to 1 when this variable is equal to 1.

We also find that lstat and medv have very large p-values. This makes some sense based on what we saw during data exploration. While lstat did seem somewhat correlated with arget, it was not a huge separation we saw on the boxplots. As for medv, the increase in proportion high-crime looking at high vs. low home value neighborhoods was not super strong either.

Let’s just double check if the p-value of lstat changes substantially if we log-transform. Also exclude medv to minimize collinearity.

## 
## Call:
## glm(formula = target ~ age + dis + small.rm + big.city + lstat, 
##     family = "binomial", data = crime)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8419  -0.3442  -0.0685   0.0000   3.3314  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)   
## (Intercept)   -1.02506    1.24062  -0.826  0.40866   
## age            0.02203    0.01150   1.916  0.05535 . 
## dis           -0.52132    0.17680  -2.949  0.00319 **
## small.rm       1.68754    0.92412   1.826  0.06783 . 
## big.city      21.23544 1233.91401   0.017  0.98627   
## lstat         -0.01749    0.04418  -0.396  0.69216   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 645.88  on 465  degrees of freedom
## Residual deviance: 188.02  on 460  degrees of freedom
## AIC: 200.02
## 
## Number of Fisher Scoring iterations: 19
## 
## Call:
## glm(formula = target ~ age + dis + small.rm + big.city + log2(lstat + 
##     1), family = "binomial", data = crime)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8388  -0.3444  -0.0680   0.0000   3.3212  
## 
## Coefficients:
##                   Estimate Std. Error z value Pr(>|z|)   
## (Intercept)       -0.73296    1.55738  -0.471  0.63790   
## age                0.02180    0.01138   1.916  0.05540 . 
## dis               -0.52483    0.17662  -2.972  0.00296 **
## small.rm           1.60839    0.83412   1.928  0.05382 . 
## big.city          21.22372 1232.84053   0.017  0.98626   
## log2(lstat + 1)   -0.13146    0.36188  -0.363  0.71641   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 645.88  on 465  degrees of freedom
## Residual deviance: 188.05  on 460  degrees of freedom
## AIC: 200.05
## 
## Number of Fisher Scoring iterations: 19

Either way, lstat not significant. Let’s show now a model excluding both lstat and medv.

## 
## Call:
## glm(formula = target ~ age + dis + small.rm + big.city, family = "binomial", 
##     data = crime)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8271  -0.3446  -0.0682   0.0000   3.3392  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)   
## (Intercept)   -1.08311    1.21246  -0.893  0.37169   
## age            0.02025    0.01042   1.943  0.05197 . 
## dis           -0.52300    0.17451  -2.997  0.00273 **
## small.rm       1.47061    0.73787   1.993  0.04626 * 
## big.city      21.20209 1233.11296   0.017  0.98628   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 645.88  on 465  degrees of freedom
## Residual deviance: 188.18  on 461  degrees of freedom
## AIC: 198.18
## 
## Number of Fisher Scoring iterations: 19

Age is just shy of significant now, while the other two variables are significant using an alpha of .05.

What if we convert age to a binary using a cutoff of 80% or more of owner-occupied homes being built prior to 1940?

## 
## Call:
## glm(formula = target ~ age + dis + small.rm + big.city, family = "binomial", 
##     data = crime_binary_age)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.81535  -0.35747  -0.07583   0.00004   3.14470  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   -0.09557    0.69832  -0.137 0.891143    
## age            1.07649    0.44139   2.439 0.014734 *  
## dis           -0.54362    0.15882  -3.423 0.000619 ***
## small.rm       1.40729    0.72813   1.933 0.053266 .  
## big.city      21.27326 1216.50730   0.017 0.986048    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 645.88  on 465  degrees of freedom
## Residual deviance: 186.01  on 461  degrees of freedom
## AIC: 196.01
## 
## Number of Fisher Scoring iterations: 19

In this case age becomes significant, but now small.rm is not.

What if we exclude small.rm?

## 
## Call:
## glm(formula = target ~ age + dis + big.city, family = "binomial", 
##     data = crime_binary_age)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.26706  -0.36733  -0.07605   0.00004   3.13821  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    0.00276    0.69896   0.004 0.996849    
## age            1.12721    0.44225   2.549 0.010809 *  
## dis           -0.55235    0.15875  -3.479 0.000502 ***
## big.city      21.27145 1235.04591   0.017 0.986259    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 645.88  on 465  degrees of freedom
## Residual deviance: 189.79  on 462  degrees of freedom
## AIC: 197.79
## 
## Number of Fisher Scoring iterations: 19

This seems like a solid model.

What do the coefficients mean? Let’s calculate p for when age = 0 or age = 1 and for a few different values of dis, with big.city = 0.

##    big.city age dis      p
## 1         0   0   1 0.3660
## 2         0   0   2 0.2494
## 3         0   0   3 0.1605
## 4         0   0   4 0.0992
## 5         0   0   5 0.0596
## 6         0   0   6 0.0352
## 7         0   0   9 0.0069
## 8         0   0  13 0.0008
## 9         0   1   1 0.6405
## 10        0   1   2 0.5063
## 11        0   1   3 0.3712
## 12        0   1   4 0.2536
## 13        0   1   5 0.1636
## 14        0   1   6 0.1012
## 15        0   1   9 0.0210
## 16        0   1  13 0.0024

Keeping dis constant, there is a higher probability of target=1 when age=1 as compared to age=0.

Keeping age constant, there is a lower probability of target=1 as dis increases.

Because the regression is based on logit(p), the change in p itself is not linear. We see that when dis=1, the difference in p for age is much larger than for when dis is higher.

Approach 2 - Exclude big.city from the model.

Let’s start with a simple model including all neighborhoods, this time without big.city in the model.

Start with all variables again. Only change this time is let’s use age as a binary right off-the-bat.

## 
## Call:
## glm(formula = target ~ ., family = "binomial", data = crime_binary_age[, 
##     setdiff(colnames(crime), "big.city")])
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.5282  -0.5566  -0.0792   0.5952   3.2015  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  1.66941    0.55427   3.012  0.00260 ** 
## age          0.91819    0.31163   2.946  0.00321 ** 
## dis         -0.77243    0.10966  -7.044 1.87e-12 ***
## lstat        0.02590    0.03063   0.845  0.39788    
## small.rm     0.61891    0.64926   0.953  0.34046    
## low.medv     0.58672    0.35576   1.649  0.09911 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 645.88  on 465  degrees of freedom
## Residual deviance: 378.49  on 460  degrees of freedom
## AIC: 390.49
## 
## Number of Fisher Scoring iterations: 5

This time we get age and dis as significant right away.

Median home value p-value is fairly small but not significant.

What about if we include age, dis, and medv, but exclude lstat and small.rm?

## 
## Call:
## glm(formula = target ~ age + dis + low.medv, family = "binomial", 
##     data = crime_binary_age)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.2530  -0.5559  -0.0818   0.5973   3.2012  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   2.0053     0.4583   4.376 1.21e-05 ***
## age           0.9739     0.3090   3.152  0.00162 ** 
## dis          -0.7998     0.1085  -7.368 1.73e-13 ***
## low.medv      0.8111     0.2883   2.813  0.00491 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 645.88  on 465  degrees of freedom
## Residual deviance: 381.21  on 462  degrees of freedom
## AIC: 389.21
## 
## Number of Fisher Scoring iterations: 5

Now low.medv is also significant.

Let’s compare to a model with just age and dis.

## 
## Call:
## glm(formula = target ~ age + dis, family = "binomial", data = crime_binary_age)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.1235  -0.5632  -0.0588   0.5689   3.1657  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   2.1843     0.4519   4.833 1.34e-06 ***
## age           1.3061     0.2866   4.557 5.19e-06 ***
## dis          -0.8071     0.1086  -7.434 1.06e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 645.88  on 465  degrees of freedom
## Residual deviance: 389.03  on 463  degrees of freedom
## AIC: 395.03
## 
## Number of Fisher Scoring iterations: 5

Age and dis still significant when considered on their own.

Summary of three most promising models

In summary, the three most promising models are:

  1. big.city + age + dis
  2. age + dis + low.medv
  3. age + dis

In model #1, the coefficient of big.city is extremely large, such that the probability will always be 100% when big.city = 1. When big.city = 0, this coefficient will not factor in, so then the coefficients of age and dis become important.

In the first model, we saw that the predicted probability of an area being high crime really went down dramatically as you get further from employment centers. It seems fairly intuitive that more suburban areas might be less likely to be high-crime, so this makes sense. Similarly most suburbs are newer housing stock, while more urban areas have more older homes. So it makes sense also that more old homes might be associated with more likely to be high-crime because they are less suburban.

The direction of the coefficients for age and dis in models 2 and 3 are the same as for model 1.

In model #2, variable low.medv has a positive coefficient. It seems very reasonable that lower home values would be associated with a higher probability of being a high crime area.

Model selection

Comparing model statistics

For each of the three models, run confusionMatrix from the caret package.

Use p = 0.5 as the cutoff, where prediction = 1 when p > 0.5.

## [1] "Model #1 (incl. big.city, age, dis):"
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 230  34
##          1   7 195
##                                           
##                Accuracy : 0.912           
##                  95% CI : (0.8825, 0.9361)
##     No Information Rate : 0.5086          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.8236          
##  Mcnemar's Test P-Value : 4.896e-05       
##                                           
##             Sensitivity : 0.8515          
##             Specificity : 0.9705          
##          Pos Pred Value : 0.9653          
##          Neg Pred Value : 0.8712          
##              Prevalence : 0.4914          
##          Detection Rate : 0.4185          
##    Detection Prevalence : 0.4335          
##       Balanced Accuracy : 0.9110          
##                                           
##        'Positive' Class : 1               
## 
## [1] "Model #2 (incl. age, dis, and low.medv, not big.city):"
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 192  39
##          1  45 190
##                                           
##                Accuracy : 0.8197          
##                  95% CI : (0.7818, 0.8536)
##     No Information Rate : 0.5086          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.6395          
##  Mcnemar's Test P-Value : 0.5854          
##                                           
##             Sensitivity : 0.8297          
##             Specificity : 0.8101          
##          Pos Pred Value : 0.8085          
##          Neg Pred Value : 0.8312          
##              Prevalence : 0.4914          
##          Detection Rate : 0.4077          
##    Detection Prevalence : 0.5043          
##       Balanced Accuracy : 0.8199          
##                                           
##        'Positive' Class : 1               
## 
## [1] "Model #3 (incl. age and dis, not big.city):"
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 192  43
##          1  45 186
##                                           
##                Accuracy : 0.8112          
##                  95% CI : (0.7726, 0.8457)
##     No Information Rate : 0.5086          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.6223          
##  Mcnemar's Test P-Value : 0.9151          
##                                           
##             Sensitivity : 0.8122          
##             Specificity : 0.8101          
##          Pos Pred Value : 0.8052          
##          Neg Pred Value : 0.8170          
##              Prevalence : 0.4914          
##          Detection Rate : 0.3991          
##    Detection Prevalence : 0.4957          
##       Balanced Accuracy : 0.8112          
##                                           
##        'Positive' Class : 1               
## 

We get the following statistics for our models.

  1. Accuracy = 0.912, Classification error rate = 0.088, Sensitivity = 0.8515, Specificity = 0.9705
  2. Accuracy = 0.8197, Classification error rate = 0.1803, Sensitivity = 0.8297, Specificity = 0.8101
  3. Accuracy = 0.8112, Classification error rate = 0.1888, Sensitivity = 0.8122, Specificity = 0.8101

We can also see the confusion matrix for each model.

These summary stats are very similar for model #2 vs. model #3. Let’s look at the correlation between the predictions of the two models.

We find some strong differences in the exact numbers when both models predict very low p. E.g. 0.3 or 0.2 in one model vs. 0.1 in another.

Using p=0.5 as a cutoff, though, these models mostly make the same predictions.

I think let’s proceed with looking only at models 1 and 3. This wil also be nice because it allows us to compare keeping everything constant except whether or not to include big.city.

Let’s get precision and F1 score for these models.

## [1] "Precision model1:"
## [1] 0.9653465
## [1] "Precision model3:"
## [1] 0.8051948
## [1] "F1 model1:"
## [1] 0.9048724
## [1] "F1 model3:"
## [1] 0.8086957

Summary of model1 vs. model3 including precision and F1:

  1. Accuracy = 0.912, Classification error rate = 0.088, Sensitivity = 0.8515, Specificity = 0.9705, Precision = 0.9653, F1 = 0.9049
  2. Accuracy = 0.8112, Classification error rate = 0.1888, Sensitivity = 0.8122, Specificity = 0.8101, Precision = 0.8052, F1 = 0.8087

We expected model 3 to perform somewhat worse, since of course removing a predictor variable that is able to so perfectly find many true positives hurts the model stats.

Looking at the confusion matrices, we saw that the number of false negatives was actually pretty similar between the two models.

Where model3 really performs worse is in the number of false positives.

Let’s look at the correlation of probabilities between these two models.

What likely happened here is that some small cities had similar characteristics to big cities. The model accounted for the many big cities it saw that had high crime and these characteristics, and so inflated the magnitude of the coefficients for these variables.

Let’s make some ROC curves.

## 
## Call:
## roc.default(response = crime$target, predictor = model1_predictions_numeric,     plot = TRUE, main = "Model1")
## 
## Data: model1_predictions_numeric in 237 controls (crime$target 0) < 229 cases (crime$target 1).
## Area under the curve: 0.968

## 
## Call:
## roc.default(response = crime$target, predictor = model3_predictions_numeric,     plot = TRUE, main = "Model3")
## 
## Data: model3_predictions_numeric in 237 controls (crime$target 0) < 229 cases (crime$target 1).
## Area under the curve: 0.8953

The AUC for model1 is excellent (0.968). For model3 it is substantially worse, as we might expect (0.8953).

Let’s replot ROC for model3, adding some information on thresholds. Maybe we will want to use a threshold other than 0.5.

## 
## Call:
## roc.default(response = crime$target, predictor = model3_predictions_numeric,     plot = TRUE, main = "Model3", print.thres = seq(from = 0.2,         to = 0.8, by = 0.1))
## 
## Data: model3_predictions_numeric in 237 controls (crime$target 0) < 229 cases (crime$target 1).
## Area under the curve: 0.8953

Let’s print confusion matrix for a few different higher thresholds for model3.

## [1] "Threshold = 0.6:"
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 202  57
##          1  35 172
##                                           
##                Accuracy : 0.8026          
##                  95% CI : (0.7635, 0.8378)
##     No Information Rate : 0.5086          
##     P-Value [Acc > NIR] : < 2e-16         
##                                           
##                   Kappa : 0.6044          
##  Mcnemar's Test P-Value : 0.02857         
##                                           
##             Sensitivity : 0.7511          
##             Specificity : 0.8523          
##          Pos Pred Value : 0.8309          
##          Neg Pred Value : 0.7799          
##              Prevalence : 0.4914          
##          Detection Rate : 0.3691          
##    Detection Prevalence : 0.4442          
##       Balanced Accuracy : 0.8017          
##                                           
##        'Positive' Class : 1               
## 
## [1] "Threshold = 0.7:"
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 205  70
##          1  32 159
##                                           
##                Accuracy : 0.7811          
##                  95% CI : (0.7408, 0.8178)
##     No Information Rate : 0.5086          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.5609          
##  Mcnemar's Test P-Value : 0.0002487       
##                                           
##             Sensitivity : 0.6943          
##             Specificity : 0.8650          
##          Pos Pred Value : 0.8325          
##          Neg Pred Value : 0.7455          
##              Prevalence : 0.4914          
##          Detection Rate : 0.3412          
##    Detection Prevalence : 0.4099          
##       Balanced Accuracy : 0.7797          
##                                           
##        'Positive' Class : 1               
## 
## [1] "Threshold = 0.8:"
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 215  87
##          1  22 142
##                                          
##                Accuracy : 0.7661         
##                  95% CI : (0.725, 0.8038)
##     No Information Rate : 0.5086         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.5298         
##  Mcnemar's Test P-Value : 8.783e-10      
##                                          
##             Sensitivity : 0.6201         
##             Specificity : 0.9072         
##          Pos Pred Value : 0.8659         
##          Neg Pred Value : 0.7119         
##              Prevalence : 0.4914         
##          Detection Rate : 0.3047         
##    Detection Prevalence : 0.3519         
##       Balanced Accuracy : 0.7636         
##                                          
##        'Positive' Class : 1              
## 

We would have to increase the threshold quite high before we get fewer than 30 false positives, which doesn’t seem like a good idea.

Summary of chosen model

Despite the considerably worse performance of the model excluding big.city, I think we should proceed with the model that excludes this variable.

Excluding this variable will make the model much more generalizable. For example, what if we wanted to apply a similar model to a dataset containing only neighborhoods within a large municipality, where the target variable was calculated only within municipality limits? E.g. neighborhoods of NYC, looking at the median crime rate within only the five boroughs.

Age of housing stock and distance to jobs are very general variables that we may find outside of just this data set.

Application of this model to evaluation data

Let’s apply our model to the evaluation data.

This output is available at my Github here:

https://raw.githubusercontent.com/heathergeiger/Data621_hw3/master/geiger_heather_final_predictions_evaluation.csv

Code appendix

Code used to create this output is available here:

https://github.com/heathergeiger/Data621_hw3/blob/master/geiger_heather_data621_homework3_cleaned.R