The following is a description of the task at hand as given by Marcus Ellis.
In this homework assignment, you will explore, analyze and model a data set containing information on crime for various neighborhoods of a major city. Each record has a response variable indicating whether or not the crime rate is above the median crime rate (1) or not (0).
Your objective is to build a binary logistic regression model on the training data set to predict whether the neighborhood will be at risk for high crime levels. You will provide classifications and probabilities for the evaluation data set using your binary logistic regression model. You can only use the variables given to you (or variables that you derive from the variables provided). Below is a short description of the variables of interest in the data set:
For this analysis, we will use tidyverse libraries including ggplot2, dplyr, and tidyr.
We will also use the caret and pROC packages.
First, we will read in the data, available at my Github here:
https://raw.githubusercontent.com/heathergeiger/Data621_hw3/master/crime-training-data.csv
Once the data is read in, we check number of non-NA values for each numeric variable.
For binary variables, we run the table function to check number of 0’s vs. 1’s.
## [1] "Total number of records in data:"
## [1] 466
## [1] "Number of non-NA values per variable in data:"
## Variable n
## 1 age 466
## 2 black 466
## 3 dis 466
## 4 indus 466
## 5 lstat 466
## 6 medv 466
## 7 nox 466
## 8 ptratio 466
## 9 rad 466
## 10 rm 466
## 11 tax 466
## 12 zn 466
## [1] "Frequency of variable chas 0 vs. 1:"
##
## 0 1
## 433 33
## [1] "Frequency of variable target 0 vs. 1:"
##
## 0 1
## 237 229
We find that there are no NAs in the numeric or factor variables, so we do not need to worry about filling in any missing values.
Next, let’s check the number of unique values per variable.
## Variable Num.uniques
## 3 chas 2
## 12 target 2
## 10 rad 9
## 14 zn 26
## 9 ptratio 46
## 13 tax 63
## 5 indus 73
## 8 nox 79
## 7 medv 218
## 2 black 331
## 1 age 333
## 4 dis 380
## 11 rm 419
## 6 lstat 424
We find that several variables appear to have many fewer unique values than the number of records.
We saw in particular that variables rad, zn, ptratio, tax, indus, and nox had a lot of repeated values.
Let’s check how many unique combinations we get if we look at combinations of these variables.
## [1] "Number of unique combinations of rad/tax/indus/ptratio/zn:"
## [1] 75
## [1] "Number of unique combinations of rad/tax/indus/ptratio/zn/nox:"
## [1] 95
Now, look in more detail at why there are more unique records if we include nox.
## Rad.tax.indus.ptratio.zn Num.separate.values.nox
## 18 24/666/18.1/20.2/0 18
## 55 5/264/3.97/13/20 2
## 66 5/403/19.58/14.7/0 2
## 75 8/307/6.2/17.4/0 2
For nox, a number of different values are associated when rad = 24. We also find two distinct values each for nox for three other combinations of rad/tax/indus/ptratio/zn.
My hypothesis is that neighborhoods that share a common value for rad, tax, indus, ptratio, and zn are all part of the same municipality.
I will proceed based on this hypothesis for the remainder of this analysis.
First, let’s print details on the number of neighborhoods per municipality.
## [1] "Number of neighborhoods per municipality:"
##
## 1 2 3 4 5 6 7 8 9 10 11 12 14 16 19 28 121
## 15 14 11 9 5 4 4 1 3 1 2 1 1 1 1 1 1
Now check the number of high vs. low-crime neighborhoods per municipality.
## [1] "Table for proportions more frequent neighborhood type:"
##
## 0.6667 0.7 0.7143 0.7778 0.8 0.8182 0.9091 0.9286 1
## 1 1 1 1 1 1 1 1 67
We find that most municipalities either have all high-crime or all low-crime neighborhoods rather than a mix.
We find that the number of neighborhoods in a municipality is extremely correlated with being majority high-crime.
In fact, we find that in the four municipalities with at least 15 neighborhoods, all neighborhoods are high-crime.
Maybe we can use the municipality information to guide our model. To do this, though, we would want to be sure at least some of the same municipalities are in the evaluation data. Let’s check.
Read in the evaluation data from here:
https://raw.githubusercontent.com/heathergeiger/Data621_hw3/master/crime-evaluation-data.csv
## [1] "Number of records in evaluation:"
## [1] 40
## [1] "Number of records matching 121-neighborhood municipality from training:"
## [1] 11
## [1] "Number of records matching 28-neighborhood municipality from training:"
## [1] 2
## [1] "Number of records matching 19-neighborhood municipality from training:"
## [1] 3
## [1] "Number of records matching 16-neighborhood municipality from training:"
## [1] 2
It seems the evaluation data contains data from some of the same municipalities as the training data.
It seems the main use of the municipality data would be to match up each neighborhood to whether it is part of a municipality with many neighborhoods, vs. only a few. Here it seems reasonable to set a cutoff of a municipality being considered large at 15 or more neighborhoods for that municipality in the training data. We will use variables rad, tax, indus, ptratio, and zn to determine what municipality each neighborhood is from. Then create a binary variable “big.city” that says whether or not the neighborhood is in a municipality with 15+ neighborhoods.
After this, discard all municipality-level variables and focus on neighborhood-level variables to fill in the rest of the model.
We still have a number of other variables to look at, which we are assuming were measured at the neighborhood rather than municipality level. To review, the variables left after removing municipality-level ones are:
Variable “chas” is a factor variable, which we already saw earlier was 1 for 33/466 records.
Before even look at variable “black”, we see in the definition that is a transformed variable with an unusual relationship to the original variable. As the original variable is a proportion, it must be between 0 and 1. If we plug 1 into the equation 1000 * (x - 0.63)^2, we get 136.9. We also get 136.9 when x (proportion of black residents) is 0.26. When “black” is > 136.9 and <= 396.9, the two roots of the equation will be 0 <= x < 0.26 and x > 1, so we can reasonably infer that the actual proportion is < 0.26. When “black” is <= 136.9, the actual proportion has a degree of uncertainty, with more uncertainty as the value is higher.
Therefore as long as the range in our data for this from 0 to 396.9, it seems reasonable to convert this variable to a binary of either proportion of black residents < 0.26, or >= 0.26.
Let’s just check this.
## [1] 0.32 396.90
Good to go!
Now, let’s look at the distribution of rm, age, dis, lstat, and medv.
We find that dis, lstat, and medv are somewhat right-skewed, while age is left-skewed.
Before we can fully explore the correlation between variables, we really need to create our “big city” binary variable and convert “black” to binary. Let’s do that now.
Note, here we will make black a value of 1 if the proportion of black residents is 26% or higher.
To review, our new sets of predictor variables are:
Let’s look at the correlations between all of these.
First, look at the co-occurence of the three pairs of binary variables.
## [1] "Variables chas vs. black:"
## black
## chas Black.residents.26percent.plus
## Borders.Charles 1
## Doesnt.border.Charles 32
## black
## chas Black.residents.under.26percent
## Borders.Charles 32
## Doesnt.border.Charles 401
## [1] "Variables chas vs. big.city:"
## big_city
## chas Big.city Small.city
## Borders.Charles 19 14
## Doesnt.border.Charles 165 268
## [1] "Variables black vs. big.city:"
## big_city
## black Big.city Small.city
## Black.residents.26percent.plus 32 1
## Black.residents.under.26percent 152 281
We find that nearly all neighborhoods with 26%+ black residents are part of a big city. And we already found that a neighborhood being part of a big city is a 100% predictor for having higher than median crime rate in this dataset. So this variable does not really add much additional information and should probably be removed.
What about the association of bordering the Charles River and being part of a big city versus numeric variables?
We find that bordering the Charles River appears to be negatively associated with distance to employment centers, where neighborhoods that border the Charles are closer to jobs. It is positively associated with median home value.
We see noticeable differences between neighborhoods in small vs. big municipalities for pretty much all of the numeric variables.
Now, make scatterplots showing correlation between pairs of numeric variables. Since we see such dramatic differences between big and small cities, let’s separate them.
It seems that the relationship between lstat and medv is non-linear.
Perhaps log-transformation could be used to convert their relationship to linear?
Let’s use log2-transformation, as it is a bit easier to understand for the range of our data.
Add a pseudo-count of 1 to lstat so that there are no NAs.
And the correlation of both as log2-transformed:
It seems like log-transformation of these variables (at least of lstat especially) may be a good idea before including in the model.
Let’s check if there appears to be an association between each predictor variable with the target by making some more boxplots.
Since 100% of neighborhoods in big cities have target = 1, use only small-city neighborhoods to create the boxplot.
For variable chas, make a simple contingency table.
## [1] "As raw numbers:"
## target_small_cities
## chas_small_cities Above.median.crime Below.median.crime
## Borders.Charles 2 12
## Doesnt.border.Charles 43 225
## [1] "As proportions:"
## target_small_cities
## chas_small_cities Above.median.crime Below.median.crime
## Borders.Charles 0.1429 0.8571
## Doesnt.border.Charles 0.1604 0.8396
Bordering the Charles River does not appear to be associated with a neighborhood being high-crime within small municipalities. The proportion of high-crime neighborhoods in small municipalities that do border the Charles River is nearly identical to the proportion in small municipalities that do not.
Variables age, dis, and lstat appear correlated with whether or not a neighborhood within a small municipality has a high crime rate. We also see a correlation after log-transforming lstat.
Based on just these boxplots, rm and medv (whether log-transformed or not) do not appear very correlated with our target variable.
Let’s also try looking at a contingency table of target vs. a few different levels of these variables.
##
## rm_small_cities 0 1
## large (>=6.5, < 7.5) 73 9
## med (>= 5.5, < 6.5) 148 25
## small (<5.5) 4 6
## XL (>= 7.5) 12 5
##
## rm_small_cities 0 1
## large (>=6.5, < 7.5) 0.8902 0.1098
## med (>= 5.5, < 6.5) 0.8555 0.1445
## small (<5.5) 0.4000 0.6000
## XL (>= 7.5) 0.7059 0.2941
##
## medv_small_cities 0 1
## high (>= 30) 49 12
## low (< 15) 6 2
## low-med (>= 15, < 20) 44 17
## med (>= 20, < 25) 106 14
## med-high (>= 25, < 30) 32 0
##
## medv_small_cities 0 1
## high (>= 30) 0.8033 0.1967
## low (< 15) 0.7500 0.2500
## low-med (>= 15, < 20) 0.7213 0.2787
## med (>= 20, < 25) 0.8833 0.1167
## med-high (>= 25, < 30) 1.0000 0.0000
While the cutoffs we created for grouping were somewhat arbitrary, it appears that there may be something here.
Mainly, it appears that having a very low average room size is positively correlated with having a higher crime rate. However, once you have even a modest average room size, it seems there is not much of an additional correlation as you increase the number of rooms beyond a minimal level.
Similar idea for median home value. It seems having a very low median home value is positively correlated with crime, but the probability does not necessarily increase linearly once you get beyond a certain level.
Let’s use density plots to get a better sense of where might be good to place cutoffs, if we decide to convert these variables to binary.
It actually looks like 5.5 was a pretty good cutoff for rm.
For medv, 20 looks like a reasonable cutoff.
One last check - what will the correlation between rm and medv be if we convert to binary using these cutoffs?
## medv_binary
## rm_binary Higher.value.home Lower.value.home
## Larger.average.home 211 61
## Small.average.home 2 8
We find the majority of neighborhoods with smaller average homes also have fairly low home values. This makes sense.
I think we now have a pretty good understanding of the data, and are ready to move on to the transformation step.
To summarize, we have already run the following transformations.
Before modeling, we should run the following additional transformations:
Just to make things simpler later on (will work the same either way), let’s also convert all binary variables except “target” to numeric (e.g. have them as actual numbers 0 and 1 rather than factors).
From here on out, we will approach this analysis using two main approaches:
Within both approaches, we may use methods like forward or backward selection to choose which other predictor variables to include.
Let’s start by simply modeling all variables including big.city.
##
## Call:
## glm(formula = target ~ ., family = "binomial", data = crime)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.8465 -0.3440 -0.0697 0.0000 3.3262
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.98101 1.27100 -0.772 0.44021
## age 0.02166 0.01175 1.844 0.06521 .
## dis -0.52263 0.17706 -2.952 0.00316 **
## lstat -0.02084 0.04877 -0.427 0.66923
## big.city 21.22864 1233.69514 0.017 0.98627
## small.rm 1.69383 0.92622 1.829 0.06744 .
## low.medv 0.07922 0.48186 0.164 0.86941
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 645.88 on 465 degrees of freedom
## Residual deviance: 188.00 on 459 degrees of freedom
## AIC: 202
##
## Number of Fisher Scoring iterations: 19
Due to quirks of how logistic regression p-values are calculated, big.city is actually technically not significant despite target always being equal to 1 when this variable is equal to 1.
We also find that lstat and medv have very large p-values. This makes some sense based on what we saw during data exploration. While lstat did seem somewhat correlated with arget, it was not a huge separation we saw on the boxplots. As for medv, the increase in proportion high-crime looking at high vs. low home value neighborhoods was not super strong either.
Let’s just double check if the p-value of lstat changes substantially if we log-transform. Also exclude medv to minimize collinearity.
##
## Call:
## glm(formula = target ~ age + dis + small.rm + big.city + lstat,
## family = "binomial", data = crime)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.8419 -0.3442 -0.0685 0.0000 3.3314
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.02506 1.24062 -0.826 0.40866
## age 0.02203 0.01150 1.916 0.05535 .
## dis -0.52132 0.17680 -2.949 0.00319 **
## small.rm 1.68754 0.92412 1.826 0.06783 .
## big.city 21.23544 1233.91401 0.017 0.98627
## lstat -0.01749 0.04418 -0.396 0.69216
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 645.88 on 465 degrees of freedom
## Residual deviance: 188.02 on 460 degrees of freedom
## AIC: 200.02
##
## Number of Fisher Scoring iterations: 19
##
## Call:
## glm(formula = target ~ age + dis + small.rm + big.city + log2(lstat +
## 1), family = "binomial", data = crime)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.8388 -0.3444 -0.0680 0.0000 3.3212
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.73296 1.55738 -0.471 0.63790
## age 0.02180 0.01138 1.916 0.05540 .
## dis -0.52483 0.17662 -2.972 0.00296 **
## small.rm 1.60839 0.83412 1.928 0.05382 .
## big.city 21.22372 1232.84053 0.017 0.98626
## log2(lstat + 1) -0.13146 0.36188 -0.363 0.71641
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 645.88 on 465 degrees of freedom
## Residual deviance: 188.05 on 460 degrees of freedom
## AIC: 200.05
##
## Number of Fisher Scoring iterations: 19
Either way, lstat not significant. Let’s show now a model excluding both lstat and medv.
##
## Call:
## glm(formula = target ~ age + dis + small.rm + big.city, family = "binomial",
## data = crime)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.8271 -0.3446 -0.0682 0.0000 3.3392
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.08311 1.21246 -0.893 0.37169
## age 0.02025 0.01042 1.943 0.05197 .
## dis -0.52300 0.17451 -2.997 0.00273 **
## small.rm 1.47061 0.73787 1.993 0.04626 *
## big.city 21.20209 1233.11296 0.017 0.98628
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 645.88 on 465 degrees of freedom
## Residual deviance: 188.18 on 461 degrees of freedom
## AIC: 198.18
##
## Number of Fisher Scoring iterations: 19
Age is just shy of significant now, while the other two variables are significant using an alpha of .05.
What if we convert age to a binary using a cutoff of 80% or more of owner-occupied homes being built prior to 1940?
##
## Call:
## glm(formula = target ~ age + dis + small.rm + big.city, family = "binomial",
## data = crime_binary_age)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.81535 -0.35747 -0.07583 0.00004 3.14470
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.09557 0.69832 -0.137 0.891143
## age 1.07649 0.44139 2.439 0.014734 *
## dis -0.54362 0.15882 -3.423 0.000619 ***
## small.rm 1.40729 0.72813 1.933 0.053266 .
## big.city 21.27326 1216.50730 0.017 0.986048
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 645.88 on 465 degrees of freedom
## Residual deviance: 186.01 on 461 degrees of freedom
## AIC: 196.01
##
## Number of Fisher Scoring iterations: 19
In this case age becomes significant, but now small.rm is not.
What if we exclude small.rm?
##
## Call:
## glm(formula = target ~ age + dis + big.city, family = "binomial",
## data = crime_binary_age)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.26706 -0.36733 -0.07605 0.00004 3.13821
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.00276 0.69896 0.004 0.996849
## age 1.12721 0.44225 2.549 0.010809 *
## dis -0.55235 0.15875 -3.479 0.000502 ***
## big.city 21.27145 1235.04591 0.017 0.986259
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 645.88 on 465 degrees of freedom
## Residual deviance: 189.79 on 462 degrees of freedom
## AIC: 197.79
##
## Number of Fisher Scoring iterations: 19
This seems like a solid model.
What do the coefficients mean? Let’s calculate p for when age = 0 or age = 1 and for a few different values of dis, with big.city = 0.
## big.city age dis p
## 1 0 0 1 0.3660
## 2 0 0 2 0.2494
## 3 0 0 3 0.1605
## 4 0 0 4 0.0992
## 5 0 0 5 0.0596
## 6 0 0 6 0.0352
## 7 0 0 9 0.0069
## 8 0 0 13 0.0008
## 9 0 1 1 0.6405
## 10 0 1 2 0.5063
## 11 0 1 3 0.3712
## 12 0 1 4 0.2536
## 13 0 1 5 0.1636
## 14 0 1 6 0.1012
## 15 0 1 9 0.0210
## 16 0 1 13 0.0024
Keeping dis constant, there is a higher probability of target=1 when age=1 as compared to age=0.
Keeping age constant, there is a lower probability of target=1 as dis increases.
Because the regression is based on logit(p), the change in p itself is not linear. We see that when dis=1, the difference in p for age is much larger than for when dis is higher.
Let’s start with a simple model including all neighborhoods, this time without big.city in the model.
Start with all variables again. Only change this time is let’s use age as a binary right off-the-bat.
##
## Call:
## glm(formula = target ~ ., family = "binomial", data = crime_binary_age[,
## setdiff(colnames(crime), "big.city")])
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.5282 -0.5566 -0.0792 0.5952 3.2015
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.66941 0.55427 3.012 0.00260 **
## age 0.91819 0.31163 2.946 0.00321 **
## dis -0.77243 0.10966 -7.044 1.87e-12 ***
## lstat 0.02590 0.03063 0.845 0.39788
## small.rm 0.61891 0.64926 0.953 0.34046
## low.medv 0.58672 0.35576 1.649 0.09911 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 645.88 on 465 degrees of freedom
## Residual deviance: 378.49 on 460 degrees of freedom
## AIC: 390.49
##
## Number of Fisher Scoring iterations: 5
This time we get age and dis as significant right away.
Median home value p-value is fairly small but not significant.
What about if we include age, dis, and medv, but exclude lstat and small.rm?
##
## Call:
## glm(formula = target ~ age + dis + low.medv, family = "binomial",
## data = crime_binary_age)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.2530 -0.5559 -0.0818 0.5973 3.2012
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.0053 0.4583 4.376 1.21e-05 ***
## age 0.9739 0.3090 3.152 0.00162 **
## dis -0.7998 0.1085 -7.368 1.73e-13 ***
## low.medv 0.8111 0.2883 2.813 0.00491 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 645.88 on 465 degrees of freedom
## Residual deviance: 381.21 on 462 degrees of freedom
## AIC: 389.21
##
## Number of Fisher Scoring iterations: 5
Now low.medv is also significant.
Let’s compare to a model with just age and dis.
##
## Call:
## glm(formula = target ~ age + dis, family = "binomial", data = crime_binary_age)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.1235 -0.5632 -0.0588 0.5689 3.1657
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.1843 0.4519 4.833 1.34e-06 ***
## age 1.3061 0.2866 4.557 5.19e-06 ***
## dis -0.8071 0.1086 -7.434 1.06e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 645.88 on 465 degrees of freedom
## Residual deviance: 389.03 on 463 degrees of freedom
## AIC: 395.03
##
## Number of Fisher Scoring iterations: 5
Age and dis still significant when considered on their own.
In summary, the three most promising models are:
In model #1, the coefficient of big.city is extremely large, such that the probability will always be 100% when big.city = 1. When big.city = 0, this coefficient will not factor in, so then the coefficients of age and dis become important.
In the first model, we saw that the predicted probability of an area being high crime really went down dramatically as you get further from employment centers. It seems fairly intuitive that more suburban areas might be less likely to be high-crime, so this makes sense. Similarly most suburbs are newer housing stock, while more urban areas have more older homes. So it makes sense also that more old homes might be associated with more likely to be high-crime because they are less suburban.
The direction of the coefficients for age and dis in models 2 and 3 are the same as for model 1.
In model #2, variable low.medv has a positive coefficient. It seems very reasonable that lower home values would be associated with a higher probability of being a high crime area.
For each of the three models, run confusionMatrix from the caret package.
Use p = 0.5 as the cutoff, where prediction = 1 when p > 0.5.
## [1] "Model #1 (incl. big.city, age, dis):"
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 230 34
## 1 7 195
##
## Accuracy : 0.912
## 95% CI : (0.8825, 0.9361)
## No Information Rate : 0.5086
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.8236
## Mcnemar's Test P-Value : 4.896e-05
##
## Sensitivity : 0.8515
## Specificity : 0.9705
## Pos Pred Value : 0.9653
## Neg Pred Value : 0.8712
## Prevalence : 0.4914
## Detection Rate : 0.4185
## Detection Prevalence : 0.4335
## Balanced Accuracy : 0.9110
##
## 'Positive' Class : 1
##
## [1] "Model #2 (incl. age, dis, and low.medv, not big.city):"
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 192 39
## 1 45 190
##
## Accuracy : 0.8197
## 95% CI : (0.7818, 0.8536)
## No Information Rate : 0.5086
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.6395
## Mcnemar's Test P-Value : 0.5854
##
## Sensitivity : 0.8297
## Specificity : 0.8101
## Pos Pred Value : 0.8085
## Neg Pred Value : 0.8312
## Prevalence : 0.4914
## Detection Rate : 0.4077
## Detection Prevalence : 0.5043
## Balanced Accuracy : 0.8199
##
## 'Positive' Class : 1
##
## [1] "Model #3 (incl. age and dis, not big.city):"
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 192 43
## 1 45 186
##
## Accuracy : 0.8112
## 95% CI : (0.7726, 0.8457)
## No Information Rate : 0.5086
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.6223
## Mcnemar's Test P-Value : 0.9151
##
## Sensitivity : 0.8122
## Specificity : 0.8101
## Pos Pred Value : 0.8052
## Neg Pred Value : 0.8170
## Prevalence : 0.4914
## Detection Rate : 0.3991
## Detection Prevalence : 0.4957
## Balanced Accuracy : 0.8112
##
## 'Positive' Class : 1
##
We get the following statistics for our models.
We can also see the confusion matrix for each model.
These summary stats are very similar for model #2 vs. model #3. Let’s look at the correlation between the predictions of the two models.
We find some strong differences in the exact numbers when both models predict very low p. E.g. 0.3 or 0.2 in one model vs. 0.1 in another.
Using p=0.5 as a cutoff, though, these models mostly make the same predictions.
I think let’s proceed with looking only at models 1 and 3. This wil also be nice because it allows us to compare keeping everything constant except whether or not to include big.city.
Let’s get precision and F1 score for these models.
## [1] "Precision model1:"
## [1] 0.9653465
## [1] "Precision model3:"
## [1] 0.8051948
## [1] "F1 model1:"
## [1] 0.9048724
## [1] "F1 model3:"
## [1] 0.8086957
Summary of model1 vs. model3 including precision and F1:
We expected model 3 to perform somewhat worse, since of course removing a predictor variable that is able to so perfectly find many true positives hurts the model stats.
Looking at the confusion matrices, we saw that the number of false negatives was actually pretty similar between the two models.
Where model3 really performs worse is in the number of false positives.
Let’s look at the correlation of probabilities between these two models.
What likely happened here is that some small cities had similar characteristics to big cities. The model accounted for the many big cities it saw that had high crime and these characteristics, and so inflated the magnitude of the coefficients for these variables.
Let’s make some ROC curves.
##
## Call:
## roc.default(response = crime$target, predictor = model1_predictions_numeric, plot = TRUE, main = "Model1")
##
## Data: model1_predictions_numeric in 237 controls (crime$target 0) < 229 cases (crime$target 1).
## Area under the curve: 0.968
##
## Call:
## roc.default(response = crime$target, predictor = model3_predictions_numeric, plot = TRUE, main = "Model3")
##
## Data: model3_predictions_numeric in 237 controls (crime$target 0) < 229 cases (crime$target 1).
## Area under the curve: 0.8953
The AUC for model1 is excellent (0.968). For model3 it is substantially worse, as we might expect (0.8953).
Let’s replot ROC for model3, adding some information on thresholds. Maybe we will want to use a threshold other than 0.5.
##
## Call:
## roc.default(response = crime$target, predictor = model3_predictions_numeric, plot = TRUE, main = "Model3", print.thres = seq(from = 0.2, to = 0.8, by = 0.1))
##
## Data: model3_predictions_numeric in 237 controls (crime$target 0) < 229 cases (crime$target 1).
## Area under the curve: 0.8953
Let’s print confusion matrix for a few different higher thresholds for model3.
## [1] "Threshold = 0.6:"
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 202 57
## 1 35 172
##
## Accuracy : 0.8026
## 95% CI : (0.7635, 0.8378)
## No Information Rate : 0.5086
## P-Value [Acc > NIR] : < 2e-16
##
## Kappa : 0.6044
## Mcnemar's Test P-Value : 0.02857
##
## Sensitivity : 0.7511
## Specificity : 0.8523
## Pos Pred Value : 0.8309
## Neg Pred Value : 0.7799
## Prevalence : 0.4914
## Detection Rate : 0.3691
## Detection Prevalence : 0.4442
## Balanced Accuracy : 0.8017
##
## 'Positive' Class : 1
##
## [1] "Threshold = 0.7:"
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 205 70
## 1 32 159
##
## Accuracy : 0.7811
## 95% CI : (0.7408, 0.8178)
## No Information Rate : 0.5086
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.5609
## Mcnemar's Test P-Value : 0.0002487
##
## Sensitivity : 0.6943
## Specificity : 0.8650
## Pos Pred Value : 0.8325
## Neg Pred Value : 0.7455
## Prevalence : 0.4914
## Detection Rate : 0.3412
## Detection Prevalence : 0.4099
## Balanced Accuracy : 0.7797
##
## 'Positive' Class : 1
##
## [1] "Threshold = 0.8:"
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 215 87
## 1 22 142
##
## Accuracy : 0.7661
## 95% CI : (0.725, 0.8038)
## No Information Rate : 0.5086
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.5298
## Mcnemar's Test P-Value : 8.783e-10
##
## Sensitivity : 0.6201
## Specificity : 0.9072
## Pos Pred Value : 0.8659
## Neg Pred Value : 0.7119
## Prevalence : 0.4914
## Detection Rate : 0.3047
## Detection Prevalence : 0.3519
## Balanced Accuracy : 0.7636
##
## 'Positive' Class : 1
##
We would have to increase the threshold quite high before we get fewer than 30 false positives, which doesn’t seem like a good idea.
Despite the considerably worse performance of the model excluding big.city, I think we should proceed with the model that excludes this variable.
Excluding this variable will make the model much more generalizable. For example, what if we wanted to apply a similar model to a dataset containing only neighborhoods within a large municipality, where the target variable was calculated only within municipality limits? E.g. neighborhoods of NYC, looking at the median crime rate within only the five boroughs.
Age of housing stock and distance to jobs are very general variables that we may find outside of just this data set.
Let’s apply our model to the evaluation data.
This output is available at my Github here:
Code used to create this output is available here:
https://github.com/heathergeiger/Data621_hw3/blob/master/geiger_heather_data621_homework3_cleaned.R