library(corrplot)
library(tidyverse)
In this homework assignment, we are asked to explore, analyze and model a data set containing information on crime for various neighborhoods of a major city. Each record has a response variable indicating whether or not the crime rate is above the median crime rate (1) or not (0).
Our objective is to build a binary logistic regression model on the training data set to predict whether the neighborhood will be at risk for high crime levels. Below is a short description of the variables of interest in the data set:
x <- "C:\\Users\\bloom\\Downloads\\crime-training-data_modified.csv"
train_set_full <- read.csv(x)
nrow(train_set_full)
## [1] 466
summary(train_set_full)
## zn indus chas nox
## Min. : 0.00 Min. : 0.460 Min. :0.00000 Min. :0.3890
## 1st Qu.: 0.00 1st Qu.: 5.145 1st Qu.:0.00000 1st Qu.:0.4480
## Median : 0.00 Median : 9.690 Median :0.00000 Median :0.5380
## Mean : 11.58 Mean :11.105 Mean :0.07082 Mean :0.5543
## 3rd Qu.: 16.25 3rd Qu.:18.100 3rd Qu.:0.00000 3rd Qu.:0.6240
## Max. :100.00 Max. :27.740 Max. :1.00000 Max. :0.8710
## rm age dis rad
## Min. :3.863 Min. : 2.90 Min. : 1.130 Min. : 1.00
## 1st Qu.:5.887 1st Qu.: 43.88 1st Qu.: 2.101 1st Qu.: 4.00
## Median :6.210 Median : 77.15 Median : 3.191 Median : 5.00
## Mean :6.291 Mean : 68.37 Mean : 3.796 Mean : 9.53
## 3rd Qu.:6.630 3rd Qu.: 94.10 3rd Qu.: 5.215 3rd Qu.:24.00
## Max. :8.780 Max. :100.00 Max. :12.127 Max. :24.00
## tax ptratio lstat medv
## Min. :187.0 Min. :12.6 Min. : 1.730 Min. : 5.00
## 1st Qu.:281.0 1st Qu.:16.9 1st Qu.: 7.043 1st Qu.:17.02
## Median :334.5 Median :18.9 Median :11.350 Median :21.20
## Mean :409.5 Mean :18.4 Mean :12.631 Mean :22.59
## 3rd Qu.:666.0 3rd Qu.:20.2 3rd Qu.:16.930 3rd Qu.:25.00
## Max. :711.0 Max. :22.0 Max. :37.970 Max. :50.00
## target
## Min. :0.0000
## 1st Qu.:0.0000
## Median :0.0000
## Mean :0.4914
## 3rd Qu.:1.0000
## Max. :1.0000
We see above that our categorical variables are treated as numerical. Let’s adjust this:
train_set_full$chas <- as.factor(train_set_full$chas)
train_set_full$target <- as.factor(train_set_full$target)
summary(train_set_full)
## zn indus chas nox rm
## Min. : 0.00 Min. : 0.460 0:433 Min. :0.3890 Min. :3.863
## 1st Qu.: 0.00 1st Qu.: 5.145 1: 33 1st Qu.:0.4480 1st Qu.:5.887
## Median : 0.00 Median : 9.690 Median :0.5380 Median :6.210
## Mean : 11.58 Mean :11.105 Mean :0.5543 Mean :6.291
## 3rd Qu.: 16.25 3rd Qu.:18.100 3rd Qu.:0.6240 3rd Qu.:6.630
## Max. :100.00 Max. :27.740 Max. :0.8710 Max. :8.780
## age dis rad tax
## Min. : 2.90 Min. : 1.130 Min. : 1.00 Min. :187.0
## 1st Qu.: 43.88 1st Qu.: 2.101 1st Qu.: 4.00 1st Qu.:281.0
## Median : 77.15 Median : 3.191 Median : 5.00 Median :334.5
## Mean : 68.37 Mean : 3.796 Mean : 9.53 Mean :409.5
## 3rd Qu.: 94.10 3rd Qu.: 5.215 3rd Qu.:24.00 3rd Qu.:666.0
## Max. :100.00 Max. :12.127 Max. :24.00 Max. :711.0
## ptratio lstat medv target
## Min. :12.6 Min. : 1.730 Min. : 5.00 0:237
## 1st Qu.:16.9 1st Qu.: 7.043 1st Qu.:17.02 1:229
## Median :18.9 Median :11.350 Median :21.20
## Mean :18.4 Mean :12.631 Mean :22.59
## 3rd Qu.:20.2 3rd Qu.:16.930 3rd Qu.:25.00
## Max. :22.0 Max. :37.970 Max. :50.00
Lets start by seeing how correlated our numeric/continuous variables are:
train_set_full_0 <- train_set_full %>%
select("age", "dis", "indus","lstat","medv","nox","ptratio","rad","rm","tax","zn")
train_set_full_cor <- cor(train_set_full_0)
corrplot.mixed(train_set_full_cor)
Now, we can take a look at our distributions.
train_set_full_0 %>%
#select("age", "dis", "indus","lstat","medv","nox","ptratio","rad","rm","tax","zn") %>%
pivot_longer(cols = everything(), names_to = "var", values_to = "val") %>%
ggplot(aes(val)) + geom_density() + facet_wrap(~var, scales = "free")
We see some skewing above, lets see if we can address this. This new DF will only be used for illustration purposes, as we will likely supply both the transformed and actual data in our step-wise regression model.
train_set_full1 <- train_set_full_0
train_set_full1$age <- train_set_full1$age^2
train_set_full1$ptratio <- train_set_full1$ptratio^2
train_set_full1$dis <- log(train_set_full1$dis)
train_set_full1$lstat <- log(train_set_full1$lstat)
train_set_full1$nox <- log(train_set_full1$nox)
train_set_full1$zn <- log(train_set_full1$zn+1)
train_set_full1 %>%
select("age", "dis", "indus","lstat","medv","nox","ptratio","rad","rm","tax","zn") %>%
pivot_longer(cols = everything(), names_to = "var", values_to = "val") %>%
ggplot(aes(val)) + geom_density() + facet_wrap(~var, scales = "free")
In general, we find that everythign is a bit more normal after transformation. Our “zn” and “age” variables are now considerably bimodal, which is great, as the bimodal variable should be more predictive.
We will start with a “basic” Logistic Regression Model, using our non-transformed variables, to find a starting place.
Logit_Model1 <- glm(target ~., data=train_set_full, family = binomial (link = "logit"))
summary(Logit_Model1)
##
## Call:
## glm(formula = target ~ ., family = binomial(link = "logit"),
## data = train_set_full)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.8464 -0.1445 -0.0017 0.0029 3.4665
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -40.822934 6.632913 -6.155 7.53e-10 ***
## zn -0.065946 0.034656 -1.903 0.05706 .
## indus -0.064614 0.047622 -1.357 0.17485
## chas1 0.910765 0.755546 1.205 0.22803
## nox 49.122297 7.931706 6.193 5.90e-10 ***
## rm -0.587488 0.722847 -0.813 0.41637
## age 0.034189 0.013814 2.475 0.01333 *
## dis 0.738660 0.230275 3.208 0.00134 **
## rad 0.666366 0.163152 4.084 4.42e-05 ***
## tax -0.006171 0.002955 -2.089 0.03674 *
## ptratio 0.402566 0.126627 3.179 0.00148 **
## lstat 0.045869 0.054049 0.849 0.39608
## medv 0.180824 0.068294 2.648 0.00810 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 645.88 on 465 degrees of freedom
## Residual deviance: 192.05 on 453 degrees of freedom
## AIC: 218.05
##
## Number of Fisher Scoring iterations: 9
At 192 (vs 645) residual deviance, these variables seem fairly predictive. We will now use a step-wise regression to see how successful we can be with our variables:
Logit_Model2 <- glm(target ~ nox + age+ dis + rad+ tax+ ptratio+ medv + zn, data=train_set_full, family = binomial (link = "logit"))
summary(Logit_Model2)
##
## Call:
## glm(formula = target ~ nox + age + dis + rad + tax + ptratio +
## medv + zn, family = binomial(link = "logit"), data = train_set_full)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.8295 -0.1752 -0.0021 0.0032 3.4191
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -37.415922 6.035013 -6.200 5.65e-10 ***
## nox 42.807768 6.678692 6.410 1.46e-10 ***
## age 0.032950 0.010951 3.009 0.00262 **
## dis 0.654896 0.214050 3.060 0.00222 **
## rad 0.725109 0.149788 4.841 1.29e-06 ***
## tax -0.007756 0.002653 -2.924 0.00346 **
## ptratio 0.323628 0.111390 2.905 0.00367 **
## medv 0.110472 0.035445 3.117 0.00183 **
## zn -0.068648 0.032019 -2.144 0.03203 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 645.88 on 465 degrees of freedom
## Residual deviance: 197.32 on 457 degrees of freedom
## AIC: 215.32
##
## Number of Fisher Scoring iterations: 9
Our Residual Deviance increased to 197, but the AIC dropped to 215, meaning this is the more parsimonious model.
We will now add in our transformed variables, and let the step-wise regression determine if this adds predictive value.
Logit_Model3 <- glm(target ~ zn + indus + chas + nox + rm + age + dis + rad + tax+ ptratio + lstat + medv +
age^2 +
ptratio^2 +
log(dis) +
log(lstat) +
log(nox) +
log(zn+1), data=train_set_full, family = binomial (link = "logit"))
summary(step(Logit_Model3))
## Start: AIC=202.64
## target ~ zn + indus + chas + nox + rm + age + dis + rad + tax +
## ptratio + lstat + medv + age^2 + ptratio^2 + log(dis) + log(lstat) +
## log(nox) + log(zn + 1)
##
## Df Deviance AIC
## - indus 1 168.64 200.64
## - log(zn + 1) 1 168.72 200.72
## - chas 1 168.82 200.82
## - zn 1 169.81 201.81
## - rm 1 169.90 201.90
## <none> 168.63 202.63
## - log(lstat) 1 172.26 204.26
## - lstat 1 173.04 205.04
## - tax 1 173.21 205.21
## - medv 1 173.53 205.53
## - log(nox) 1 177.89 209.89
## - age 1 178.55 210.55
## - dis 1 180.65 212.65
## - ptratio 1 180.73 212.73
## - nox 1 181.24 213.24
## - log(dis) 1 187.76 219.76
## - rad 1 218.84 250.84
##
## Step: AIC=200.64
## target ~ zn + chas + nox + rm + age + dis + rad + tax + ptratio +
## lstat + medv + log(dis) + log(lstat) + log(nox) + log(zn +
## 1)
##
## Df Deviance AIC
## - log(zn + 1) 1 168.72 198.72
## - chas 1 168.85 198.85
## - zn 1 169.82 199.82
## - rm 1 169.90 199.90
## <none> 168.64 200.64
## - log(lstat) 1 172.29 202.29
## - lstat 1 173.05 203.05
## - medv 1 173.53 203.53
## - tax 1 174.11 204.11
## - log(nox) 1 177.97 207.97
## - age 1 178.55 208.55
## - ptratio 1 180.89 210.89
## - nox 1 181.35 211.35
## - dis 1 182.25 212.25
## - log(dis) 1 190.73 220.73
## - rad 1 225.34 255.34
##
## Step: AIC=198.72
## target ~ zn + chas + nox + rm + age + dis + rad + tax + ptratio +
## lstat + medv + log(dis) + log(lstat) + log(nox)
##
## Df Deviance AIC
## - chas 1 168.91 196.91
## - rm 1 169.94 197.94
## <none> 168.72 198.72
## - log(lstat) 1 172.30 200.30
## - zn 1 172.56 200.56
## - lstat 1 173.18 201.18
## - medv 1 173.81 201.81
## - tax 1 174.14 202.14
## - log(nox) 1 177.97 205.97
## - age 1 178.71 206.71
## - nox 1 181.35 209.35
## - ptratio 1 181.88 209.88
## - dis 1 182.87 210.87
## - log(dis) 1 190.91 218.91
## - rad 1 225.35 253.35
##
## Step: AIC=196.91
## target ~ zn + nox + rm + age + dis + rad + tax + ptratio + lstat +
## medv + log(dis) + log(lstat) + log(nox)
##
## Df Deviance AIC
## - rm 1 170.24 196.24
## <none> 168.91 196.91
## - log(lstat) 1 172.61 198.61
## - zn 1 173.09 199.09
## - lstat 1 173.64 199.64
## - medv 1 174.13 200.13
## - tax 1 174.76 200.76
## - log(nox) 1 178.55 204.55
## - age 1 179.49 205.49
## - ptratio 1 181.89 207.89
## - nox 1 181.99 207.99
## - dis 1 183.61 209.61
## - log(dis) 1 191.61 217.61
## - rad 1 229.19 255.19
##
## Step: AIC=196.24
## target ~ zn + nox + age + dis + rad + tax + ptratio + lstat +
## medv + log(dis) + log(lstat) + log(nox)
##
## Df Deviance AIC
## <none> 170.24 196.24
## - log(lstat) 1 173.61 197.61
## - medv 1 174.61 198.61
## - zn 1 175.57 199.57
## - lstat 1 175.71 199.71
## - tax 1 176.51 200.51
## - age 1 179.88 203.88
## - log(nox) 1 180.09 204.09
## - ptratio 1 181.90 205.90
## - nox 1 183.53 207.53
## - dis 1 185.10 209.10
## - log(dis) 1 192.54 216.54
## - rad 1 229.40 253.40
##
## Call:
## glm(formula = target ~ zn + nox + age + dis + rad + tax + ptratio +
## lstat + medv + log(dis) + log(lstat) + log(nox), family = binomial(link = "logit"),
## data = train_set_full)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.9764 -0.2218 -0.0010 0.0003 3.2518
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.304e+02 9.399e+01 -3.516 0.000438 ***
## zn -6.446e-02 3.253e-02 -1.982 0.047525 *
## nox 3.794e+02 1.071e+02 3.541 0.000398 ***
## age 3.640e-02 1.250e-02 2.911 0.003598 **
## dis -3.365e+00 9.550e-01 -3.524 0.000425 ***
## rad 8.766e-01 1.742e-01 5.031 4.88e-07 ***
## tax -7.373e-03 3.100e-03 -2.378 0.017407 *
## ptratio 4.256e-01 1.301e-01 3.270 0.001075 **
## lstat 2.752e-01 1.211e-01 2.272 0.023084 *
## medv 1.092e-01 5.241e-02 2.084 0.037162 *
## log(dis) 1.672e+01 3.953e+00 4.229 2.34e-05 ***
## log(lstat) -2.889e+00 1.607e+00 -1.797 0.072323 .
## log(nox) -1.713e+02 5.477e+01 -3.127 0.001763 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 645.88 on 465 degrees of freedom
## Residual deviance: 170.24 on 453 degrees of freedom
## AIC: 196.24
##
## Number of Fisher Scoring iterations: 9
With a Residual Deviance of 170 and an AIC of 196, taking the log of “dis”, “lstat” and “nox” does in fact add significant predictive value.
We will now repeat the above with a Probit Model.
Probit_Model1 <- glm(target ~., data=train_set_full, family = binomial (link = "probit"))
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(Probit_Model1)
##
## Call:
## glm(formula = target ~ ., family = binomial(link = "probit"),
## data = train_set_full)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.8378 -0.1214 0.0000 0.0000 3.6075
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -21.747727 3.471508 -6.265 3.74e-10 ***
## zn -0.027575 0.016908 -1.631 0.10291
## indus -0.029877 0.027201 -1.098 0.27204
## chas1 0.492849 0.412977 1.193 0.23271
## nox 26.331909 4.166104 6.321 2.61e-10 ***
## rm -0.262020 0.393514 -0.666 0.50551
## age 0.014564 0.007182 2.028 0.04258 *
## dis 0.352634 0.119663 2.947 0.00321 **
## rad 0.365330 0.088570 4.125 3.71e-05 ***
## tax -0.003858 0.001670 -2.309 0.02092 *
## ptratio 0.215668 0.069623 3.098 0.00195 **
## lstat 0.036394 0.029595 1.230 0.21880
## medv 0.092848 0.036097 2.572 0.01011 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 645.88 on 465 degrees of freedom
## Residual deviance: 195.73 on 453 degrees of freedom
## AIC: 221.73
##
## Number of Fisher Scoring iterations: 10
Probit_Model2 <- glm(target ~ zn + indus + chas + nox + rm + age + dis + rad + tax+ ptratio + lstat + medv +
age^2 +
ptratio^2 +
log(dis) +
log(lstat) +
log(nox) +
log(zn+1), data=train_set_full, family = binomial (link = "probit"))
summary(step(Probit_Model2))
## Start: AIC=206.26
## target ~ zn + indus + chas + nox + rm + age + dis + rad + tax +
## ptratio + lstat + medv + age^2 + ptratio^2 + log(dis) + log(lstat) +
## log(nox) + log(zn + 1)
##
## Df Deviance AIC
## - indus 1 172.28 204.28
## - chas 1 172.51 204.51
## - log(zn + 1) 1 172.67 204.67
## - rm 1 173.69 205.69
## <none> 172.26 206.26
## - zn 1 174.35 206.35
## - log(lstat) 1 175.68 207.68
## - tax 1 177.11 209.11
## - lstat 1 177.18 209.18
## - medv 1 177.25 209.25
## - age 1 180.83 212.83
## - log(nox) 1 182.55 214.55
## - dis 1 185.51 217.51
## - ptratio 1 185.52 217.52
## - nox 1 186.06 218.06
## - log(dis) 1 191.90 223.90
## - rad 1 221.59 253.59
##
## Step: AIC=204.28
## target ~ zn + chas + nox + rm + age + dis + rad + tax + ptratio +
## lstat + medv + log(dis) + log(lstat) + log(nox) + log(zn +
## 1)
##
## Df Deviance AIC
## - chas 1 172.59 202.59
## - log(zn + 1) 1 172.67 202.67
## - rm 1 173.69 203.69
## <none> 172.28 204.28
## - zn 1 174.35 204.35
## - log(lstat) 1 175.68 205.68
## - lstat 1 177.20 207.20
## - medv 1 177.25 207.25
## - tax 1 177.75 207.75
## - age 1 180.84 210.84
## - log(nox) 1 182.74 212.74
## - ptratio 1 185.70 215.70
## - nox 1 186.34 216.34
## - dis 1 186.76 216.76
## - log(dis) 1 194.00 224.00
## - rad 1 227.79 257.79
##
## Step: AIC=202.58
## target ~ zn + nox + rm + age + dis + rad + tax + ptratio + lstat +
## medv + log(dis) + log(lstat) + log(nox) + log(zn + 1)
##
## Df Deviance AIC
## - log(zn + 1) 1 172.92 200.92
## - rm 1 174.08 202.08
## <none> 172.59 202.59
## - zn 1 174.62 202.62
## - log(lstat) 1 176.04 204.04
## - medv 1 177.72 205.72
## - lstat 1 177.74 205.74
## - tax 1 178.43 206.43
## - age 1 181.81 209.81
## - log(nox) 1 183.46 211.46
## - ptratio 1 185.70 213.70
## - nox 1 187.10 215.10
## - dis 1 187.48 215.48
## - log(dis) 1 194.74 222.74
## - rad 1 231.15 259.15
##
## Step: AIC=200.92
## target ~ zn + nox + rm + age + dis + rad + tax + ptratio + lstat +
## medv + log(dis) + log(lstat) + log(nox)
##
## Df Deviance AIC
## - rm 1 174.31 200.31
## <none> 172.92 200.92
## - log(lstat) 1 176.13 202.13
## - zn 1 176.98 202.98
## - lstat 1 177.98 203.98
## - medv 1 178.28 204.28
## - tax 1 178.69 204.69
## - age 1 182.10 208.10
## - log(nox) 1 183.57 209.57
## - ptratio 1 186.26 212.26
## - nox 1 187.25 213.25
## - dis 1 187.76 213.76
## - log(dis) 1 194.75 220.75
## - rad 1 231.25 257.25
##
## Step: AIC=200.3
## target ~ zn + nox + age + dis + rad + tax + ptratio + lstat +
## medv + log(dis) + log(lstat) + log(nox)
##
## Df Deviance AIC
## <none> 174.31 200.31
## - log(lstat) 1 177.11 201.11
## - medv 1 178.99 202.99
## - zn 1 179.09 203.09
## - lstat 1 179.84 203.84
## - tax 1 180.71 204.71
## - age 1 182.37 206.37
## - log(nox) 1 184.97 208.97
## - ptratio 1 186.29 210.29
## - nox 1 188.60 212.60
## - dis 1 189.03 213.03
## - log(dis) 1 195.50 219.50
## - rad 1 231.40 255.40
##
## Call:
## glm(formula = target ~ zn + nox + age + dis + rad + tax + ptratio +
## lstat + medv + log(dis) + log(lstat) + log(nox), family = binomial(link = "probit"),
## data = train_set_full)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.0462 -0.2287 0.0000 0.0000 3.3557
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.785e+02 4.666e+01 -3.826 0.000130 ***
## zn -2.998e-02 1.539e-02 -1.947 0.051488 .
## nox 2.052e+02 5.324e+01 3.854 0.000116 ***
## age 1.667e-02 6.378e-03 2.613 0.008971 **
## dis -1.690e+00 4.661e-01 -3.625 0.000289 ***
## rad 4.833e-01 9.033e-02 5.350 8.79e-08 ***
## tax -4.343e-03 1.710e-03 -2.539 0.011108 *
## ptratio 2.344e-01 6.876e-02 3.408 0.000654 ***
## lstat 1.470e-01 6.366e-02 2.309 0.020928 *
## medv 5.913e-02 2.809e-02 2.105 0.035317 *
## log(dis) 8.541e+00 1.981e+00 4.311 1.63e-05 ***
## log(lstat) -1.430e+00 8.430e-01 -1.696 0.089818 .
## log(nox) -9.240e+01 2.713e+01 -3.406 0.000660 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 645.88 on 465 degrees of freedom
## Residual deviance: 174.30 on 453 degrees of freedom
## AIC: 200.3
##
## Number of Fisher Scoring iterations: 12
The Logistic regression yields slighter better results.