In this Assignment, we will explore, analyze and model a data set containing information on crime for various neighborhoods of a major city. Each record has a response variable indicating whether or not the crime rate is above the median crime rate (1) or not (0).
Your objective is to build a binary logistic regression model on the training data set to predict whether the neighborhood will be at risk for high crime levels. You will provide classifications and probabilities for the evaluation data set using your binary logistic regression model. You can only use the variables given to you (or variables that you derive from the variables provided). Below is a short description of the variables of interest in the data set:
We will first load the packages needed for this assignment:
Next, we will load the data.
Then we’ll take a look at the dataset.
## [1] 466 13
## zn indus chas nox rm age dis rad tax ptratio lstat medv target
## 1 0 19.58 0 0.605 7.929 96.2 2.0459 5 403 14.7 3.70 50.0 1
## 2 0 19.58 1 0.871 5.403 100.0 1.3216 5 403 14.7 26.82 13.4 1
## 3 0 18.10 0 0.740 6.485 100.0 1.9784 24 666 20.2 18.85 15.4 1
## 4 30 4.93 0 0.428 6.393 7.8 7.0355 6 300 16.6 5.19 23.7 0
## 5 0 2.46 0 0.488 7.155 92.2 2.7006 3 193 17.8 4.82 37.9 0
## 6 0 8.56 0 0.520 6.781 71.3 2.8561 5 384 20.9 7.67 26.5 0
## 7 0 18.10 0 0.693 5.453 100.0 1.4896 24 666 20.2 30.59 5.0 1
## 8 0 18.10 0 0.693 4.519 100.0 1.6582 24 666 20.2 36.98 7.0 1
## 9 0 5.19 0 0.515 6.316 38.1 6.4584 5 224 20.2 5.68 22.2 0
## 10 80 3.64 0 0.392 5.876 19.1 9.2203 1 315 16.4 9.25 20.9 0
Next, we’ll do a summary of the data to review.
## zn indus chas nox
## Min. : 0.00 Min. : 0.460 Min. :0.00000 Min. :0.3890
## 1st Qu.: 0.00 1st Qu.: 5.145 1st Qu.:0.00000 1st Qu.:0.4480
## Median : 0.00 Median : 9.690 Median :0.00000 Median :0.5380
## Mean : 11.58 Mean :11.105 Mean :0.07082 Mean :0.5543
## 3rd Qu.: 16.25 3rd Qu.:18.100 3rd Qu.:0.00000 3rd Qu.:0.6240
## Max. :100.00 Max. :27.740 Max. :1.00000 Max. :0.8710
## rm age dis rad
## Min. :3.863 Min. : 2.90 Min. : 1.130 Min. : 1.00
## 1st Qu.:5.887 1st Qu.: 43.88 1st Qu.: 2.101 1st Qu.: 4.00
## Median :6.210 Median : 77.15 Median : 3.191 Median : 5.00
## Mean :6.291 Mean : 68.37 Mean : 3.796 Mean : 9.53
## 3rd Qu.:6.630 3rd Qu.: 94.10 3rd Qu.: 5.215 3rd Qu.:24.00
## Max. :8.780 Max. :100.00 Max. :12.127 Max. :24.00
## tax ptratio lstat medv
## Min. :187.0 Min. :12.6 Min. : 1.730 Min. : 5.00
## 1st Qu.:281.0 1st Qu.:16.9 1st Qu.: 7.043 1st Qu.:17.02
## Median :334.5 Median :18.9 Median :11.350 Median :21.20
## Mean :409.5 Mean :18.4 Mean :12.631 Mean :22.59
## 3rd Qu.:666.0 3rd Qu.:20.2 3rd Qu.:16.930 3rd Qu.:25.00
## Max. :711.0 Max. :22.0 Max. :37.970 Max. :50.00
## target
## Min. :0.0000
## 1st Qu.:0.0000
## Median :0.0000
## Mean :0.4914
## 3rd Qu.:1.0000
## Max. :1.0000
Next, we’ll take a preview of the data.
## Rows: 466
## Columns: 13
## $ zn <dbl> 0, 0, 0, 30, 0, 0, 0, 0, 0, 80, 22, 0, 0, 22, 0, 0, 100, 20, 0…
## $ indus <dbl> 19.58, 19.58, 18.10, 4.93, 2.46, 8.56, 18.10, 18.10, 5.19, 3.6…
## $ chas <int> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ nox <dbl> 0.605, 0.871, 0.740, 0.428, 0.488, 0.520, 0.693, 0.693, 0.515,…
## $ rm <dbl> 7.929, 5.403, 6.485, 6.393, 7.155, 6.781, 5.453, 4.519, 6.316,…
## $ age <dbl> 96.2, 100.0, 100.0, 7.8, 92.2, 71.3, 100.0, 100.0, 38.1, 19.1,…
## $ dis <dbl> 2.0459, 1.3216, 1.9784, 7.0355, 2.7006, 2.8561, 1.4896, 1.6582…
## $ rad <int> 5, 5, 24, 6, 3, 5, 24, 24, 5, 1, 7, 5, 24, 7, 3, 3, 5, 5, 24, …
## $ tax <int> 403, 403, 666, 300, 193, 384, 666, 666, 224, 315, 330, 398, 66…
## $ ptratio <dbl> 14.7, 14.7, 20.2, 16.6, 17.8, 20.9, 20.2, 20.2, 20.2, 16.4, 19…
## $ lstat <dbl> 3.70, 26.82, 18.85, 5.19, 4.82, 7.67, 30.59, 36.98, 5.68, 9.25…
## $ medv <dbl> 50.0, 13.4, 15.4, 23.7, 37.9, 26.5, 5.0, 7.0, 22.2, 20.9, 24.8…
## $ target <int> 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 0,…
Next, we’ll use skim to get a final summary.
| Name | crime_training |
| Number of rows | 466 |
| Number of columns | 13 |
| _______________________ | |
| Column type frequency: | |
| numeric | 13 |
| ________________________ | |
| Group variables | None |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| zn | 0 | 1 | 11.58 | 23.36 | 0.00 | 0.00 | 0.00 | 16.25 | 100.00 | ▇▁▁▁▁ |
| indus | 0 | 1 | 11.11 | 6.85 | 0.46 | 5.15 | 9.69 | 18.10 | 27.74 | ▇▆▁▇▁ |
| chas | 0 | 1 | 0.07 | 0.26 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
| nox | 0 | 1 | 0.55 | 0.12 | 0.39 | 0.45 | 0.54 | 0.62 | 0.87 | ▇▇▅▃▁ |
| rm | 0 | 1 | 6.29 | 0.70 | 3.86 | 5.89 | 6.21 | 6.63 | 8.78 | ▁▂▇▂▁ |
| age | 0 | 1 | 68.37 | 28.32 | 2.90 | 43.88 | 77.15 | 94.10 | 100.00 | ▂▂▂▃▇ |
| dis | 0 | 1 | 3.80 | 2.11 | 1.13 | 2.10 | 3.19 | 5.21 | 12.13 | ▇▅▂▁▁ |
| rad | 0 | 1 | 9.53 | 8.69 | 1.00 | 4.00 | 5.00 | 24.00 | 24.00 | ▇▂▁▁▃ |
| tax | 0 | 1 | 409.50 | 167.90 | 187.00 | 281.00 | 334.50 | 666.00 | 711.00 | ▇▇▅▁▇ |
| ptratio | 0 | 1 | 18.40 | 2.20 | 12.60 | 16.90 | 18.90 | 20.20 | 22.00 | ▁▃▅▅▇ |
| lstat | 0 | 1 | 12.63 | 7.10 | 1.73 | 7.04 | 11.35 | 16.93 | 37.97 | ▇▇▅▂▁ |
| medv | 0 | 1 | 22.59 | 9.24 | 5.00 | 17.02 | 21.20 | 25.00 | 50.00 | ▂▇▅▁▁ |
| target | 0 | 1 | 0.49 | 0.50 | 0.00 | 0.00 | 0.00 | 1.00 | 1.00 | ▇▁▁▁▇ |
We can see that we’re not missing any data. We see the mean for each variable along with the standard deviation and some other descriptive statistics.
Now, we’ll explore the data with the DataExplorer package.
## rows columns discrete_columns continuous_columns all_missing_columns
## 1 466 13 0 13 0
## total_missing_values complete_rows total_observations memory_usage
## 1 0 466 6058 44440
The output shows our data. It gives us a view of our data structure. We see the univariate distribution for each variable. We also have qq plots, a correlation analysis and the PCA.
Next, we can use the above exploration to inform how we’ll process our data. Already we know we’re not dealing with any missing data, so we don’t need to account for that. We do see some interesting information from the Correlation Matrix: several of our features have > 75% correlation.
Let’s start with addressing those We see in the chart above that ‘nox’ and ‘dis’ are highly correlated, as well as ‘nox’ and ‘tax’. Reviewing the data dictionary, however, ‘nox’ seems entirely unrelated to the dataset:
nox: nitrogen oxides concentration (parts per 10 million) (predictor variable)
These high correlations could be a false signal as we discussed in the reading, so we will go ahead and remove the ‘nox’ variable.
This should improve our correlation matrices as well as the general understandability of the model:
We still see high correlation between rad (index of accessibility to radial highways) and tax (full-value property-tax rate per $10,000), but I’m not sure if this is a true correlation. Looking at QQ plots they seem to both follow a stepped distribution. For now, we can leave them both in ahead of model development.
Our chas feature has very, very low correlation to all the other features, likely because it’s so imbalanced:
## # A tibble: 2 × 2
## chas n
## <int> <int>
## 1 0 433
## 2 1 33
As a result, it seems unlikely to influence the model, but we can leave in as it also seems unlikely to do harm.
We also checked the balance of our target and it’s fairly even, so we needn’t up/down sample.
## # A tibble: 2 × 2
## target n
## <int> <int>
## 1 0 237
## 2 1 229
Finally, we can create some new features using transformations. Specifically, we want to try adding a feature that interplays existing predictors that have lower correlations. We create ‘tax/room’.
We can also bucket some of our variables that seem to have a clear cut into bins or binary features to align with where the data splits. We see our Tax histogram has a clear cutoff around 500, our Zn histogram shows a big cutoff between 0 and > 0, and rad cuts off around 10. These booleans provide more options for model development.
Now that we got a good look at the data and it has been cleaned, let’s build some models. The first will be a general model with all variables. The second will be a model where we keep the best P-Values as well as try some transformations from the histogram transformation plots above. And finally, the third model will utilize AIC for feature selection. But, before we can begin, we should add in a test and train from the training data so we can evaluate ourselves before evaluating on the larger dataset.
##
## Call:
## glm(formula = target ~ ., family = binomial, data = training_set)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.60584 -0.44017 -0.00611 0.00012 2.34924
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -9.646e+00 3.886e+00 -2.482 0.0131 *
## zn -8.280e-02 5.670e-02 -1.460 0.1442
## indus 7.946e-02 4.132e-02 1.923 0.0545 .
## chas -2.324e-01 6.434e-01 -0.361 0.7180
## rm 8.852e-01 8.663e-01 1.022 0.3069
## age 1.908e-02 1.115e-02 1.711 0.0870 .
## dis -3.114e-01 1.774e-01 -1.755 0.0792 .
## rad 4.242e-01 1.699e-01 2.496 0.0126 *
## tax -7.457e-03 1.025e-02 -0.728 0.4669
## ptratio 6.104e-03 1.034e-01 0.059 0.9529
## lstat 9.711e-02 5.025e-02 1.933 0.0533 .
## medv -3.186e-02 1.281e-01 -0.249 0.8036
## roomtax 2.137e+00 2.797e+00 0.764 0.4449
## tax_over_500 -2.029e+01 6.130e+03 -0.003 0.9974
## zn_bool 1.417e+00 1.420e+00 0.998 0.3184
## rad_bool 3.152e+01 6.212e+03 0.005 0.9960
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 516.96 on 372 degrees of freedom
## Residual deviance: 210.17 on 357 degrees of freedom
## AIC: 242.17
##
## Number of Fisher Scoring iterations: 18
## zn indus chas rm age dis
## 10.108999 2.015331 1.146665 10.963475 2.220217 2.435420
## rad tax ptratio lstat medv roomtax
## 1.849561 17.777649 2.053556 2.782750 35.578614 18.233430
## tax_over_500 zn_bool rad_bool
## 38.201709 12.182433 38.201720
##
## Call:
## glm(formula = target ~ zn + indus + age + dis + rad + lstat +
## sqrt(medv), family = binomial, data = crime_training_processed)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.74763 -0.42517 -0.02679 0.01360 2.61892
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -8.738744 2.332749 -3.746 0.000180 ***
## zn -0.039128 0.018246 -2.144 0.031999 *
## indus 0.037217 0.030252 1.230 0.218611
## age 0.034843 0.008971 3.884 0.000103 ***
## dis -0.210012 0.137918 -1.523 0.127826
## rad 0.498795 0.113041 4.412 1.02e-05 ***
## lstat 0.053315 0.039088 1.364 0.172578
## sqrt(medv) 0.683501 0.302379 2.260 0.023796 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 645.88 on 465 degrees of freedom
## Residual deviance: 272.43 on 458 degrees of freedom
## AIC: 288.43
##
## Number of Fisher Scoring iterations: 8
## zn indus age dis rad lstat sqrt(medv)
## 1.405958 1.849807 1.639744 1.929138 1.100535 2.357435 2.669292
We can see that with the change in data, indus is rejected and so is lstat.
##
## Call:
## glm(formula = target ~ zn + age + dis + rad + sqrt(medv), family = binomial,
## data = crime_training_processed)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.78314 -0.42893 -0.02715 0.01550 2.69725
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -5.754153 1.653276 -3.480 0.000501 ***
## zn -0.037930 0.018083 -2.098 0.035948 *
## age 0.038886 0.008821 4.408 1.04e-05 ***
## dis -0.295646 0.125955 -2.347 0.018913 *
## rad 0.485413 0.110208 4.405 1.06e-05 ***
## sqrt(medv) 0.288344 0.205799 1.401 0.161185
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 645.88 on 465 degrees of freedom
## Residual deviance: 276.14 on 460 degrees of freedom
## AIC: 288.14
##
## Number of Fisher Scoring iterations: 8
## zn age dis rad sqrt(medv)
## 1.368793 1.537031 1.573063 1.081409 1.297983
and now sqrt(medv) is, despite having a lower p-value in the previous model
##
## Call:
## glm(formula = target ~ zn + age + dis + rad, family = binomial,
## data = crime_training_processed)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.81929 -0.43664 -0.03360 0.01256 2.82478
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.027624 1.056269 -3.813 0.000137 ***
## zn -0.029838 0.016287 -1.832 0.066938 .
## age 0.035154 0.008244 4.264 2.01e-05 ***
## dis -0.345765 0.120894 -2.860 0.004235 **
## rad 0.499628 0.108919 4.587 4.49e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 645.88 on 465 degrees of freedom
## Residual deviance: 278.14 on 461 degrees of freedom
## AIC: 288.14
##
## Number of Fisher Scoring iterations: 8
## zn age dis rad
## 1.217808 1.377139 1.467963 1.072801
Now, let’s build an AIC model. AIC Stands for Akaike Information Criterion, and ultimately is an estimator of in-sample prediction error and is similar to the adjusted R-squared measures we see in our regression output summaries. It will give us the features with the lowest score AIC. Lower scores can indicate a more parsimonious model, relative to a model fit with a higher AIC. It can therefore give an indication of the relative quality of statistical models for a given set of data.
## Start: AIC=242.17
## target ~ zn + indus + chas + rm + age + dis + rad + tax + ptratio +
## lstat + medv + roomtax + tax_over_500 + zn_bool + rad_bool
##
## Df Deviance AIC
## - ptratio 1 210.18 240.18
## - medv 1 210.24 240.24
## - chas 1 210.30 240.30
## - tax 1 210.71 240.71
## - roomtax 1 210.78 240.78
## - rad_bool 1 211.08 241.08
## - rm 1 211.24 241.24
## - zn_bool 1 211.33 241.33
## <none> 210.17 242.17
## - age 1 213.16 243.16
## - dis 1 213.43 243.43
## - tax_over_500 1 213.57 243.57
## - lstat 1 213.82 243.82
## - zn 1 214.00 244.00
## - indus 1 214.02 244.02
## - rad 1 217.27 247.27
##
## Step: AIC=240.18
## target ~ zn + indus + chas + rm + age + dis + rad + tax + lstat +
## medv + roomtax + tax_over_500 + zn_bool + rad_bool
##
## Df Deviance AIC
## - medv 1 210.24 238.24
## - chas 1 210.31 238.31
## - tax 1 210.71 238.71
## - roomtax 1 210.78 238.78
## - rad_bool 1 211.10 239.10
## - rm 1 211.27 239.27
## - zn_bool 1 211.52 239.52
## <none> 210.18 240.18
## - age 1 213.19 241.19
## - tax_over_500 1 213.57 241.57
## - dis 1 213.59 241.59
## - lstat 1 213.83 241.83
## - indus 1 214.03 242.03
## - zn 1 214.07 242.07
## - rad 1 217.38 245.38
##
## Step: AIC=238.24
## target ~ zn + indus + chas + rm + age + dis + rad + tax + lstat +
## roomtax + tax_over_500 + zn_bool + rad_bool
##
## Df Deviance AIC
## - chas 1 210.38 236.38
## - rad_bool 1 211.12 237.12
## - zn_bool 1 211.54 237.54
## - tax 1 211.56 237.56
## - roomtax 1 211.94 237.94
## <none> 210.24 238.24
## - rm 1 212.57 238.57
## - age 1 213.42 239.42
## - dis 1 213.64 239.64
## - lstat 1 213.85 239.85
## - zn 1 214.12 240.12
## - indus 1 214.30 240.30
## - tax_over_500 1 214.56 240.56
## - rad 1 217.83 243.83
##
## Step: AIC=236.38
## target ~ zn + indus + rm + age + dis + rad + tax + lstat + roomtax +
## tax_over_500 + zn_bool + rad_bool
##
## Df Deviance AIC
## - rad_bool 1 211.36 235.36
## - tax 1 211.56 235.56
## - zn_bool 1 211.78 235.78
## - roomtax 1 212.02 236.02
## <none> 210.38 236.38
## - rm 1 212.69 236.69
## - age 1 213.52 237.52
## - dis 1 213.82 237.82
## - lstat 1 213.87 237.87
## - indus 1 214.30 238.30
## - zn 1 214.35 238.35
## - tax_over_500 1 214.95 238.95
## - rad 1 217.91 241.91
##
## Step: AIC=235.36
## target ~ zn + indus + rm + age + dis + rad + tax + lstat + roomtax +
## tax_over_500 + zn_bool
##
## Df Deviance AIC
## - roomtax 1 212.71 234.71
## - zn_bool 1 212.83 234.83
## - tax 1 212.84 234.84
## <none> 211.36 235.36
## - rm 1 213.83 235.83
## - age 1 214.34 236.34
## - lstat 1 214.82 236.82
## - tax_over_500 1 214.97 236.97
## - zn 1 215.36 237.36
## - dis 1 215.40 237.40
## - indus 1 215.92 237.92
## - rad 1 248.59 270.59
##
## Step: AIC=234.71
## target ~ zn + indus + rm + age + dis + rad + tax + lstat + tax_over_500 +
## zn_bool
##
## Df Deviance AIC
## - tax 1 213.09 233.09
## - age 1 214.70 234.70
## <none> 212.71 234.71
## - zn_bool 1 214.90 234.90
## - lstat 1 215.58 235.58
## - tax_over_500 1 216.62 236.62
## - indus 1 217.52 237.52
## - zn 1 217.59 237.59
## - rm 1 217.90 237.90
## - dis 1 219.28 239.28
## - rad 1 252.38 272.38
##
## Step: AIC=233.09
## target ~ zn + indus + rm + age + dis + rad + lstat + tax_over_500 +
## zn_bool
##
## Df Deviance AIC
## - age 1 214.97 232.97
## <none> 213.09 233.09
## - zn_bool 1 215.32 233.32
## - lstat 1 215.96 233.96
## - indus 1 217.57 235.57
## - zn 1 218.03 236.03
## - rm 1 218.74 236.74
## - dis 1 219.40 237.40
## - tax_over_500 1 220.27 238.27
## - rad 1 254.35 272.35
##
## Step: AIC=232.97
## target ~ zn + indus + rm + dis + rad + lstat + tax_over_500 +
## zn_bool
##
## Df Deviance AIC
## <none> 214.97 232.97
## - zn_bool 1 218.20 234.20
## - indus 1 220.92 236.92
## - lstat 1 221.02 237.02
## - zn 1 221.78 237.78
## - tax_over_500 1 222.49 238.49
## - rm 1 223.59 239.59
## - dis 1 227.54 243.54
## - rad 1 256.36 272.36
##
## Call:
## glm(formula = target ~ zn + indus + rm + dis + rad + lstat +
## tax_over_500 + zn_bool, family = binomial, data = training_set)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.90506 -0.46068 -0.00222 0.07761 2.31877
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -10.31362 3.14487 -3.280 0.001040 **
## zn -0.11500 0.05598 -2.054 0.039967 *
## indus 0.08571 0.03612 2.373 0.017661 *
## rm 1.11453 0.39859 2.796 0.005170 **
## dis -0.46396 0.13773 -3.369 0.000756 ***
## rad 0.49956 0.10772 4.638 3.52e-06 ***
## lstat 0.10893 0.04448 2.449 0.014334 *
## tax_over_500 -4.22409 2.08095 -2.030 0.042368 *
## zn_bool 2.15848 1.27342 1.695 0.090070 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 516.96 on 372 degrees of freedom
## Residual deviance: 214.97 on 364 degrees of freedom
## AIC: 232.97
##
## Number of Fisher Scoring iterations: 8
Decide on the criteria for selecting the best binary logistic regression model. Will you select models with slightly worse performance if it makes more sense or is more parsimonious? Discuss why you selected your models. For the binary logistic regression model, will you use a metric such as log likelihood, AIC, ROC curve, etc.? Using the training data set, evaluate the binary logistic regression model based on (a) accuracy, (b) classification error rate, (c) precision, (d) sensitivity, (e) specificity, (f) F1 score, (g) AUC, and (h) confusion matrix. Make predictions using the evaluation data set.
Confusion matrix and ROC for Model 1:
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
## 2.5 % 97.5 %
## (Intercept) -1.726347e+01 -2.028807e+00
## zn -1.939300e-01 2.833166e-02
## indus -1.525845e-03 1.604546e-01
## chas -1.493340e+00 1.028623e+00
## rm -8.127673e-01 2.583263e+00
## age -2.772739e-03 4.094168e-02
## dis -6.592151e-01 3.635677e-02
## rad 9.110371e-02 7.572547e-01
## tax -2.754584e-02 1.263122e-02
## ptratio -1.964902e-01 2.086986e-01
## lstat -1.377130e-03 1.955923e-01
## medv -2.828798e-01 2.191617e-01
## roomtax -3.345801e+00 7.619749e+00
## tax_over_500 -1.203487e+04 1.199430e+04
## zn_bool -1.366075e+00 4.199121e+00
## rad_bool -1.214348e+04 1.220652e+04
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 47 11
## 1 0 35
##
## Accuracy : 0.8817
## 95% CI : (0.7982, 0.9395)
## No Information Rate : 0.5054
## P-Value [Acc > NIR] : 1.516e-14
##
## Kappa : 0.7628
##
## Mcnemar's Test P-Value : 0.002569
##
## Sensitivity : 1.0000
## Specificity : 0.7609
## Pos Pred Value : 0.8103
## Neg Pred Value : 1.0000
## Prevalence : 0.5054
## Detection Rate : 0.5054
## Detection Prevalence : 0.6237
## Balanced Accuracy : 0.8804
##
## 'Positive' Class : 0
##
Confusion matrix and ROC for Model 2:
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 43 10
## 1 4 36
##
## Accuracy : 0.8495
## 95% CI : (0.7603, 0.9152)
## No Information Rate : 0.5054
## P-Value [Acc > NIR] : 3.635e-12
##
## Kappa : 0.6985
##
## Mcnemar's Test P-Value : 0.1814
##
## Sensitivity : 0.9149
## Specificity : 0.7826
## Pos Pred Value : 0.8113
## Neg Pred Value : 0.9000
## Prevalence : 0.5054
## Detection Rate : 0.4624
## Detection Prevalence : 0.5699
## Balanced Accuracy : 0.8488
##
## 'Positive' Class : 0
##
Confusion matrix and ROC for Model 3:
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 46 10
## 1 1 36
##
## Accuracy : 0.8817
## 95% CI : (0.7982, 0.9395)
## No Information Rate : 0.5054
## P-Value [Acc > NIR] : 1.516e-14
##
## Kappa : 0.7629
##
## Mcnemar's Test P-Value : 0.01586
##
## Sensitivity : 0.9787
## Specificity : 0.7826
## Pos Pred Value : 0.8214
## Neg Pred Value : 0.9730
## Prevalence : 0.5054
## Detection Rate : 0.4946
## Detection Prevalence : 0.6022
## Balanced Accuracy : 0.8807
##
## 'Positive' Class : 0
##
| model | predictors | F1 | deviance | Accuracy | Sensitivity | Specificity | Precision | r2 | AIC |
|---|---|---|---|---|---|---|---|---|---|
| Model 1 | 15 | 0.8952381 | 210.1724 | 0.8817204 | 0.8817204 | 0.7628101 | 0.7982136 | 0.5934426 | 242.1724 |
| Model 2 | 4 | 0.8600000 | 278.1438 | 0.8494624 | 0.8494624 | 0.6984715 | 0.7603395 | 0.5693541 | 288.1438 |
| Model 3 | 8 | 0.8932039 | 214.9690 | 0.8817204 | 0.8817204 | 0.7629200 | 0.7982136 | 0.5841641 | 232.9690 |
When we compare the models we could see that Model 2 has less Accuracy and R2 and Higher AIC also low specificity, so we could reject Model 2.
Model 1 and Model 3 have similar statistics, but in Model 1 we included all the predictors including the derived fields. Since we performed Stepwise-AIC optimization in Model 3 it has lower AIC. Also Model 3 has lower R^2 and but similar Accuracy and Specificity as Model 1. So Model 3 is our choice.
## zn indus chas rm age dis rad tax ptratio lstat medv predict
## 1 0 7.07 0 7.185 61.1 4.9671 2 242 17.8 4.03 34.7 0
## 2 0 8.14 0 6.096 84.5 4.4619 4 307 21.0 10.26 18.2 0
## 3 0 8.14 0 6.495 94.4 4.4547 4 307 21.0 12.80 18.4 0
## 4 0 8.14 0 5.950 82.0 3.9900 4 307 21.0 27.71 13.2 0
## 5 0 5.96 0 5.850 41.5 3.9342 5 279 19.2 8.77 21.0 0
## 6 25 5.13 0 5.741 66.2 7.2254 8 284 19.7 13.15 18.7 0
## 7 25 5.13 0 5.966 93.4 6.8185 8 284 19.7 14.44 16.0 0
## 8 0 4.49 0 6.630 56.1 4.4377 3 247 18.5 6.53 26.6 0
## 9 0 4.49 0 6.121 56.8 3.7476 3 247 18.5 8.44 22.2 0
## 10 0 2.89 0 6.163 69.6 3.4952 2 276 18.0 11.34 21.4 0
## 11 0 25.65 0 5.856 97.0 1.9444 2 188 19.1 25.41 17.3 1
## 12 0 25.65 0 5.613 95.6 1.7572 2 188 19.1 27.26 15.7 1
## 13 0 21.89 0 5.637 94.7 1.9799 4 437 21.2 18.34 14.3 1
## 14 0 19.58 0 6.101 93.0 2.2834 5 403 14.7 9.81 25.0 1
## 15 0 19.58 0 5.880 97.3 2.3887 5 403 14.7 12.03 19.1 1
## 16 0 10.59 1 5.960 92.1 3.8771 4 277 18.6 17.27 21.7 0
## 17 0 6.20 0 6.552 21.4 3.3751 8 307 17.4 3.76 31.5 0
## 18 0 6.20 0 8.247 70.4 3.6519 8 307 17.4 3.95 48.3 1
## 19 22 5.86 0 6.957 6.8 8.9067 7 330 19.1 3.53 29.6 0
## 20 90 2.97 0 7.088 20.8 7.3073 1 285 15.3 7.85 32.2 0
## 21 80 1.76 0 6.230 31.5 9.0892 1 241 18.2 12.93 20.1 0
## 22 33 2.18 0 6.616 58.1 3.3700 7 222 18.4 8.93 28.4 0
## 23 0 9.90 0 6.122 52.8 2.6403 4 304 18.4 5.98 22.1 0
## 24 0 7.38 0 6.415 40.1 4.7211 5 287 19.6 6.12 25.0 0
## 25 0 7.38 0 6.312 28.9 5.4159 5 287 19.6 6.15 23.0 0
## 26 0 5.19 0 5.895 59.6 5.6150 5 224 20.2 10.56 18.5 0
## 27 80 2.01 0 6.635 29.7 8.3440 4 280 17.0 5.99 24.5 0
## 28 0 18.10 0 3.561 87.9 1.6132 24 666 20.2 7.12 27.5 1
## 29 0 18.10 1 7.016 97.5 1.2024 24 666 20.2 2.96 50.0 1
## 30 0 18.10 0 6.348 86.1 2.0527 24 666 20.2 17.64 14.5 1
## 31 0 18.10 0 5.935 87.9 1.8206 24 666 20.2 34.02 8.4 1
## 32 0 18.10 0 5.627 93.9 1.8172 24 666 20.2 22.88 12.8 1
## 33 0 18.10 0 5.818 92.4 1.8662 24 666 20.2 22.11 10.5 1
## 34 0 18.10 0 6.219 100.0 2.0048 24 666 20.2 16.59 18.4 1
## 35 0 18.10 0 5.854 96.6 1.8956 24 666 20.2 23.79 10.8 1
## 36 0 18.10 0 6.525 86.5 2.4358 24 666 20.2 18.13 14.1 1
## 37 0 18.10 0 6.376 88.4 2.5671 24 666 20.2 14.65 17.7 1
## 38 0 18.10 0 6.209 65.4 2.9634 24 666 20.2 13.22 21.4 1
## 39 0 9.69 0 5.794 70.6 2.8927 6 391 19.2 14.10 18.3 0
## 40 0 11.93 0 6.976 91.0 2.1675 1 273 21.0 5.64 23.9 0