Assignment 3: Crime

In this Assignment, we will explore, analyze and model a data set containing information on crime for various neighborhoods of a major city. Each record has a response variable indicating whether or not the crime rate is above the median crime rate (1) or not (0).

Your objective is to build a binary logistic regression model on the training data set to predict whether the neighborhood will be at risk for high crime levels. You will provide classifications and probabilities for the evaluation data set using your binary logistic regression model. You can only use the variables given to you (or variables that you derive from the variables provided). Below is a short description of the variables of interest in the data set:

1. Data Exploration

We will first load the packages needed for this assignment:

Next, we will load the data.

Then we’ll take a look at the dataset.

## [1] 466  13
##    zn indus chas   nox    rm   age    dis rad tax ptratio lstat medv target
## 1   0 19.58    0 0.605 7.929  96.2 2.0459   5 403    14.7  3.70 50.0      1
## 2   0 19.58    1 0.871 5.403 100.0 1.3216   5 403    14.7 26.82 13.4      1
## 3   0 18.10    0 0.740 6.485 100.0 1.9784  24 666    20.2 18.85 15.4      1
## 4  30  4.93    0 0.428 6.393   7.8 7.0355   6 300    16.6  5.19 23.7      0
## 5   0  2.46    0 0.488 7.155  92.2 2.7006   3 193    17.8  4.82 37.9      0
## 6   0  8.56    0 0.520 6.781  71.3 2.8561   5 384    20.9  7.67 26.5      0
## 7   0 18.10    0 0.693 5.453 100.0 1.4896  24 666    20.2 30.59  5.0      1
## 8   0 18.10    0 0.693 4.519 100.0 1.6582  24 666    20.2 36.98  7.0      1
## 9   0  5.19    0 0.515 6.316  38.1 6.4584   5 224    20.2  5.68 22.2      0
## 10 80  3.64    0 0.392 5.876  19.1 9.2203   1 315    16.4  9.25 20.9      0

Next, we’ll do a summary of the data to review.

##        zn             indus             chas              nox        
##  Min.   :  0.00   Min.   : 0.460   Min.   :0.00000   Min.   :0.3890  
##  1st Qu.:  0.00   1st Qu.: 5.145   1st Qu.:0.00000   1st Qu.:0.4480  
##  Median :  0.00   Median : 9.690   Median :0.00000   Median :0.5380  
##  Mean   : 11.58   Mean   :11.105   Mean   :0.07082   Mean   :0.5543  
##  3rd Qu.: 16.25   3rd Qu.:18.100   3rd Qu.:0.00000   3rd Qu.:0.6240  
##  Max.   :100.00   Max.   :27.740   Max.   :1.00000   Max.   :0.8710  
##        rm             age              dis              rad       
##  Min.   :3.863   Min.   :  2.90   Min.   : 1.130   Min.   : 1.00  
##  1st Qu.:5.887   1st Qu.: 43.88   1st Qu.: 2.101   1st Qu.: 4.00  
##  Median :6.210   Median : 77.15   Median : 3.191   Median : 5.00  
##  Mean   :6.291   Mean   : 68.37   Mean   : 3.796   Mean   : 9.53  
##  3rd Qu.:6.630   3rd Qu.: 94.10   3rd Qu.: 5.215   3rd Qu.:24.00  
##  Max.   :8.780   Max.   :100.00   Max.   :12.127   Max.   :24.00  
##       tax           ptratio         lstat             medv      
##  Min.   :187.0   Min.   :12.6   Min.   : 1.730   Min.   : 5.00  
##  1st Qu.:281.0   1st Qu.:16.9   1st Qu.: 7.043   1st Qu.:17.02  
##  Median :334.5   Median :18.9   Median :11.350   Median :21.20  
##  Mean   :409.5   Mean   :18.4   Mean   :12.631   Mean   :22.59  
##  3rd Qu.:666.0   3rd Qu.:20.2   3rd Qu.:16.930   3rd Qu.:25.00  
##  Max.   :711.0   Max.   :22.0   Max.   :37.970   Max.   :50.00  
##      target      
##  Min.   :0.0000  
##  1st Qu.:0.0000  
##  Median :0.0000  
##  Mean   :0.4914  
##  3rd Qu.:1.0000  
##  Max.   :1.0000

Next, we’ll take a preview of the data.

## Rows: 466
## Columns: 13
## $ zn      <dbl> 0, 0, 0, 30, 0, 0, 0, 0, 0, 80, 22, 0, 0, 22, 0, 0, 100, 20, 0…
## $ indus   <dbl> 19.58, 19.58, 18.10, 4.93, 2.46, 8.56, 18.10, 18.10, 5.19, 3.6…
## $ chas    <int> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ nox     <dbl> 0.605, 0.871, 0.740, 0.428, 0.488, 0.520, 0.693, 0.693, 0.515,…
## $ rm      <dbl> 7.929, 5.403, 6.485, 6.393, 7.155, 6.781, 5.453, 4.519, 6.316,…
## $ age     <dbl> 96.2, 100.0, 100.0, 7.8, 92.2, 71.3, 100.0, 100.0, 38.1, 19.1,…
## $ dis     <dbl> 2.0459, 1.3216, 1.9784, 7.0355, 2.7006, 2.8561, 1.4896, 1.6582…
## $ rad     <int> 5, 5, 24, 6, 3, 5, 24, 24, 5, 1, 7, 5, 24, 7, 3, 3, 5, 5, 24, …
## $ tax     <int> 403, 403, 666, 300, 193, 384, 666, 666, 224, 315, 330, 398, 66…
## $ ptratio <dbl> 14.7, 14.7, 20.2, 16.6, 17.8, 20.9, 20.2, 20.2, 20.2, 16.4, 19…
## $ lstat   <dbl> 3.70, 26.82, 18.85, 5.19, 4.82, 7.67, 30.59, 36.98, 5.68, 9.25…
## $ medv    <dbl> 50.0, 13.4, 15.4, 23.7, 37.9, 26.5, 5.0, 7.0, 22.2, 20.9, 24.8…
## $ target  <int> 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 0,…

Next, we’ll use skim to get a final summary.

Data summary
Name crime_training
Number of rows 466
Number of columns 13
_______________________
Column type frequency:
numeric 13
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
zn 0 1 11.58 23.36 0.00 0.00 0.00 16.25 100.00 ▇▁▁▁▁
indus 0 1 11.11 6.85 0.46 5.15 9.69 18.10 27.74 ▇▆▁▇▁
chas 0 1 0.07 0.26 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
nox 0 1 0.55 0.12 0.39 0.45 0.54 0.62 0.87 ▇▇▅▃▁
rm 0 1 6.29 0.70 3.86 5.89 6.21 6.63 8.78 ▁▂▇▂▁
age 0 1 68.37 28.32 2.90 43.88 77.15 94.10 100.00 ▂▂▂▃▇
dis 0 1 3.80 2.11 1.13 2.10 3.19 5.21 12.13 ▇▅▂▁▁
rad 0 1 9.53 8.69 1.00 4.00 5.00 24.00 24.00 ▇▂▁▁▃
tax 0 1 409.50 167.90 187.00 281.00 334.50 666.00 711.00 ▇▇▅▁▇
ptratio 0 1 18.40 2.20 12.60 16.90 18.90 20.20 22.00 ▁▃▅▅▇
lstat 0 1 12.63 7.10 1.73 7.04 11.35 16.93 37.97 ▇▇▅▂▁
medv 0 1 22.59 9.24 5.00 17.02 21.20 25.00 50.00 ▂▇▅▁▁
target 0 1 0.49 0.50 0.00 0.00 0.00 1.00 1.00 ▇▁▁▁▇

We can see that we’re not missing any data. We see the mean for each variable along with the standard deviation and some other descriptive statistics.

Now, we’ll explore the data with the DataExplorer package.

##   rows columns discrete_columns continuous_columns all_missing_columns
## 1  466      13                0                 13                   0
##   total_missing_values complete_rows total_observations memory_usage
## 1                    0           466               6058        44440

The output shows our data. It gives us a view of our data structure. We see the univariate distribution for each variable. We also have qq plots, a correlation analysis and the PCA.

2. Data Preparation

Next, we can use the above exploration to inform how we’ll process our data. Already we know we’re not dealing with any missing data, so we don’t need to account for that. We do see some interesting information from the Correlation Matrix: several of our features have > 75% correlation.

Let’s start with addressing those We see in the chart above that ‘nox’ and ‘dis’ are highly correlated, as well as ‘nox’ and ‘tax’. Reviewing the data dictionary, however, ‘nox’ seems entirely unrelated to the dataset:

nox: nitrogen oxides concentration (parts per 10 million) (predictor variable)

These high correlations could be a false signal as we discussed in the reading, so we will go ahead and remove the ‘nox’ variable.

This should improve our correlation matrices as well as the general understandability of the model:

We still see high correlation between rad (index of accessibility to radial highways) and tax (full-value property-tax rate per $10,000), but I’m not sure if this is a true correlation. Looking at QQ plots they seem to both follow a stepped distribution. For now, we can leave them both in ahead of model development.

Our chas feature has very, very low correlation to all the other features, likely because it’s so imbalanced:

## # A tibble: 2 × 2
##    chas     n
##   <int> <int>
## 1     0   433
## 2     1    33

As a result, it seems unlikely to influence the model, but we can leave in as it also seems unlikely to do harm.

We also checked the balance of our target and it’s fairly even, so we needn’t up/down sample.

## # A tibble: 2 × 2
##   target     n
##    <int> <int>
## 1      0   237
## 2      1   229

Finally, we can create some new features using transformations. Specifically, we want to try adding a feature that interplays existing predictors that have lower correlations. We create ‘tax/room’.

We can also bucket some of our variables that seem to have a clear cut into bins or binary features to align with where the data splits. We see our Tax histogram has a clear cutoff around 500, our Zn histogram shows a big cutoff between 0 and > 0, and rad cuts off around 10. These booleans provide more options for model development.

3. Build Models

Now that we got a good look at the data and it has been cleaned, let’s build some models. The first will be a general model with all variables. The second will be a model where we keep the best P-Values as well as try some transformations from the histogram transformation plots above. And finally, the third model will utilize AIC for feature selection. But, before we can begin, we should add in a test and train from the training data so we can evaluate ourselves before evaluating on the larger dataset.

## 
## Call:
## glm(formula = target ~ ., family = binomial, data = training_set)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.60584  -0.44017  -0.00611   0.00012   2.34924  
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)  
## (Intercept)  -9.646e+00  3.886e+00  -2.482   0.0131 *
## zn           -8.280e-02  5.670e-02  -1.460   0.1442  
## indus         7.946e-02  4.132e-02   1.923   0.0545 .
## chas         -2.324e-01  6.434e-01  -0.361   0.7180  
## rm            8.852e-01  8.663e-01   1.022   0.3069  
## age           1.908e-02  1.115e-02   1.711   0.0870 .
## dis          -3.114e-01  1.774e-01  -1.755   0.0792 .
## rad           4.242e-01  1.699e-01   2.496   0.0126 *
## tax          -7.457e-03  1.025e-02  -0.728   0.4669  
## ptratio       6.104e-03  1.034e-01   0.059   0.9529  
## lstat         9.711e-02  5.025e-02   1.933   0.0533 .
## medv         -3.186e-02  1.281e-01  -0.249   0.8036  
## roomtax       2.137e+00  2.797e+00   0.764   0.4449  
## tax_over_500 -2.029e+01  6.130e+03  -0.003   0.9974  
## zn_bool       1.417e+00  1.420e+00   0.998   0.3184  
## rad_bool      3.152e+01  6.212e+03   0.005   0.9960  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 516.96  on 372  degrees of freedom
## Residual deviance: 210.17  on 357  degrees of freedom
## AIC: 242.17
## 
## Number of Fisher Scoring iterations: 18
##           zn        indus         chas           rm          age          dis 
##    10.108999     2.015331     1.146665    10.963475     2.220217     2.435420 
##          rad          tax      ptratio        lstat         medv      roomtax 
##     1.849561    17.777649     2.053556     2.782750    35.578614    18.233430 
## tax_over_500      zn_bool     rad_bool 
##    38.201709    12.182433    38.201720
## 
## Call:
## glm(formula = target ~ zn + indus + age + dis + rad + lstat + 
##     sqrt(medv), family = binomial, data = crime_training_processed)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.74763  -0.42517  -0.02679   0.01360   2.61892  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -8.738744   2.332749  -3.746 0.000180 ***
## zn          -0.039128   0.018246  -2.144 0.031999 *  
## indus        0.037217   0.030252   1.230 0.218611    
## age          0.034843   0.008971   3.884 0.000103 ***
## dis         -0.210012   0.137918  -1.523 0.127826    
## rad          0.498795   0.113041   4.412 1.02e-05 ***
## lstat        0.053315   0.039088   1.364 0.172578    
## sqrt(medv)   0.683501   0.302379   2.260 0.023796 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 645.88  on 465  degrees of freedom
## Residual deviance: 272.43  on 458  degrees of freedom
## AIC: 288.43
## 
## Number of Fisher Scoring iterations: 8
##         zn      indus        age        dis        rad      lstat sqrt(medv) 
##   1.405958   1.849807   1.639744   1.929138   1.100535   2.357435   2.669292

We can see that with the change in data, indus is rejected and so is lstat.

## 
## Call:
## glm(formula = target ~ zn + age + dis + rad + sqrt(medv), family = binomial, 
##     data = crime_training_processed)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.78314  -0.42893  -0.02715   0.01550   2.69725  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -5.754153   1.653276  -3.480 0.000501 ***
## zn          -0.037930   0.018083  -2.098 0.035948 *  
## age          0.038886   0.008821   4.408 1.04e-05 ***
## dis         -0.295646   0.125955  -2.347 0.018913 *  
## rad          0.485413   0.110208   4.405 1.06e-05 ***
## sqrt(medv)   0.288344   0.205799   1.401 0.161185    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 645.88  on 465  degrees of freedom
## Residual deviance: 276.14  on 460  degrees of freedom
## AIC: 288.14
## 
## Number of Fisher Scoring iterations: 8
##         zn        age        dis        rad sqrt(medv) 
##   1.368793   1.537031   1.573063   1.081409   1.297983

and now sqrt(medv) is, despite having a lower p-value in the previous model

## 
## Call:
## glm(formula = target ~ zn + age + dis + rad, family = binomial, 
##     data = crime_training_processed)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.81929  -0.43664  -0.03360   0.01256   2.82478  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -4.027624   1.056269  -3.813 0.000137 ***
## zn          -0.029838   0.016287  -1.832 0.066938 .  
## age          0.035154   0.008244   4.264 2.01e-05 ***
## dis         -0.345765   0.120894  -2.860 0.004235 ** 
## rad          0.499628   0.108919   4.587 4.49e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 645.88  on 465  degrees of freedom
## Residual deviance: 278.14  on 461  degrees of freedom
## AIC: 288.14
## 
## Number of Fisher Scoring iterations: 8
##       zn      age      dis      rad 
## 1.217808 1.377139 1.467963 1.072801

Now, let’s build an AIC model. AIC Stands for Akaike Information Criterion, and ultimately is an estimator of in-sample prediction error and is similar to the adjusted R-squared measures we see in our regression output summaries. It will give us the features with the lowest score AIC. Lower scores can indicate a more parsimonious model, relative to a model fit with a higher AIC. It can therefore give an indication of the relative quality of statistical models for a given set of data.

## Start:  AIC=242.17
## target ~ zn + indus + chas + rm + age + dis + rad + tax + ptratio + 
##     lstat + medv + roomtax + tax_over_500 + zn_bool + rad_bool
## 
##                Df Deviance    AIC
## - ptratio       1   210.18 240.18
## - medv          1   210.24 240.24
## - chas          1   210.30 240.30
## - tax           1   210.71 240.71
## - roomtax       1   210.78 240.78
## - rad_bool      1   211.08 241.08
## - rm            1   211.24 241.24
## - zn_bool       1   211.33 241.33
## <none>              210.17 242.17
## - age           1   213.16 243.16
## - dis           1   213.43 243.43
## - tax_over_500  1   213.57 243.57
## - lstat         1   213.82 243.82
## - zn            1   214.00 244.00
## - indus         1   214.02 244.02
## - rad           1   217.27 247.27
## 
## Step:  AIC=240.18
## target ~ zn + indus + chas + rm + age + dis + rad + tax + lstat + 
##     medv + roomtax + tax_over_500 + zn_bool + rad_bool
## 
##                Df Deviance    AIC
## - medv          1   210.24 238.24
## - chas          1   210.31 238.31
## - tax           1   210.71 238.71
## - roomtax       1   210.78 238.78
## - rad_bool      1   211.10 239.10
## - rm            1   211.27 239.27
## - zn_bool       1   211.52 239.52
## <none>              210.18 240.18
## - age           1   213.19 241.19
## - tax_over_500  1   213.57 241.57
## - dis           1   213.59 241.59
## - lstat         1   213.83 241.83
## - indus         1   214.03 242.03
## - zn            1   214.07 242.07
## - rad           1   217.38 245.38
## 
## Step:  AIC=238.24
## target ~ zn + indus + chas + rm + age + dis + rad + tax + lstat + 
##     roomtax + tax_over_500 + zn_bool + rad_bool
## 
##                Df Deviance    AIC
## - chas          1   210.38 236.38
## - rad_bool      1   211.12 237.12
## - zn_bool       1   211.54 237.54
## - tax           1   211.56 237.56
## - roomtax       1   211.94 237.94
## <none>              210.24 238.24
## - rm            1   212.57 238.57
## - age           1   213.42 239.42
## - dis           1   213.64 239.64
## - lstat         1   213.85 239.85
## - zn            1   214.12 240.12
## - indus         1   214.30 240.30
## - tax_over_500  1   214.56 240.56
## - rad           1   217.83 243.83
## 
## Step:  AIC=236.38
## target ~ zn + indus + rm + age + dis + rad + tax + lstat + roomtax + 
##     tax_over_500 + zn_bool + rad_bool
## 
##                Df Deviance    AIC
## - rad_bool      1   211.36 235.36
## - tax           1   211.56 235.56
## - zn_bool       1   211.78 235.78
## - roomtax       1   212.02 236.02
## <none>              210.38 236.38
## - rm            1   212.69 236.69
## - age           1   213.52 237.52
## - dis           1   213.82 237.82
## - lstat         1   213.87 237.87
## - indus         1   214.30 238.30
## - zn            1   214.35 238.35
## - tax_over_500  1   214.95 238.95
## - rad           1   217.91 241.91
## 
## Step:  AIC=235.36
## target ~ zn + indus + rm + age + dis + rad + tax + lstat + roomtax + 
##     tax_over_500 + zn_bool
## 
##                Df Deviance    AIC
## - roomtax       1   212.71 234.71
## - zn_bool       1   212.83 234.83
## - tax           1   212.84 234.84
## <none>              211.36 235.36
## - rm            1   213.83 235.83
## - age           1   214.34 236.34
## - lstat         1   214.82 236.82
## - tax_over_500  1   214.97 236.97
## - zn            1   215.36 237.36
## - dis           1   215.40 237.40
## - indus         1   215.92 237.92
## - rad           1   248.59 270.59
## 
## Step:  AIC=234.71
## target ~ zn + indus + rm + age + dis + rad + tax + lstat + tax_over_500 + 
##     zn_bool
## 
##                Df Deviance    AIC
## - tax           1   213.09 233.09
## - age           1   214.70 234.70
## <none>              212.71 234.71
## - zn_bool       1   214.90 234.90
## - lstat         1   215.58 235.58
## - tax_over_500  1   216.62 236.62
## - indus         1   217.52 237.52
## - zn            1   217.59 237.59
## - rm            1   217.90 237.90
## - dis           1   219.28 239.28
## - rad           1   252.38 272.38
## 
## Step:  AIC=233.09
## target ~ zn + indus + rm + age + dis + rad + lstat + tax_over_500 + 
##     zn_bool
## 
##                Df Deviance    AIC
## - age           1   214.97 232.97
## <none>              213.09 233.09
## - zn_bool       1   215.32 233.32
## - lstat         1   215.96 233.96
## - indus         1   217.57 235.57
## - zn            1   218.03 236.03
## - rm            1   218.74 236.74
## - dis           1   219.40 237.40
## - tax_over_500  1   220.27 238.27
## - rad           1   254.35 272.35
## 
## Step:  AIC=232.97
## target ~ zn + indus + rm + dis + rad + lstat + tax_over_500 + 
##     zn_bool
## 
##                Df Deviance    AIC
## <none>              214.97 232.97
## - zn_bool       1   218.20 234.20
## - indus         1   220.92 236.92
## - lstat         1   221.02 237.02
## - zn            1   221.78 237.78
## - tax_over_500  1   222.49 238.49
## - rm            1   223.59 239.59
## - dis           1   227.54 243.54
## - rad           1   256.36 272.36
## 
## Call:
## glm(formula = target ~ zn + indus + rm + dis + rad + lstat + 
##     tax_over_500 + zn_bool, family = binomial, data = training_set)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.90506  -0.46068  -0.00222   0.07761   2.31877  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -10.31362    3.14487  -3.280 0.001040 ** 
## zn            -0.11500    0.05598  -2.054 0.039967 *  
## indus          0.08571    0.03612   2.373 0.017661 *  
## rm             1.11453    0.39859   2.796 0.005170 ** 
## dis           -0.46396    0.13773  -3.369 0.000756 ***
## rad            0.49956    0.10772   4.638 3.52e-06 ***
## lstat          0.10893    0.04448   2.449 0.014334 *  
## tax_over_500  -4.22409    2.08095  -2.030 0.042368 *  
## zn_bool        2.15848    1.27342   1.695 0.090070 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 516.96  on 372  degrees of freedom
## Residual deviance: 214.97  on 364  degrees of freedom
## AIC: 232.97
## 
## Number of Fisher Scoring iterations: 8

4. Select Models

Decide on the criteria for selecting the best binary logistic regression model. Will you select models with slightly worse performance if it makes more sense or is more parsimonious? Discuss why you selected your models. For the binary logistic regression model, will you use a metric such as log likelihood, AIC, ROC curve, etc.? Using the training data set, evaluate the binary logistic regression model based on (a) accuracy, (b) classification error rate, (c) precision, (d) sensitivity, (e) specificity, (f) F1 score, (g) AUC, and (h) confusion matrix. Make predictions using the evaluation data set.

Confusion matrix and ROC for Model 1:

## Setting levels: control = 0, case = 1
## Setting direction: controls < cases

##                      2.5 %        97.5 %
## (Intercept)  -1.726347e+01 -2.028807e+00
## zn           -1.939300e-01  2.833166e-02
## indus        -1.525845e-03  1.604546e-01
## chas         -1.493340e+00  1.028623e+00
## rm           -8.127673e-01  2.583263e+00
## age          -2.772739e-03  4.094168e-02
## dis          -6.592151e-01  3.635677e-02
## rad           9.110371e-02  7.572547e-01
## tax          -2.754584e-02  1.263122e-02
## ptratio      -1.964902e-01  2.086986e-01
## lstat        -1.377130e-03  1.955923e-01
## medv         -2.828798e-01  2.191617e-01
## roomtax      -3.345801e+00  7.619749e+00
## tax_over_500 -1.203487e+04  1.199430e+04
## zn_bool      -1.366075e+00  4.199121e+00
## rad_bool     -1.214348e+04  1.220652e+04
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 47 11
##          1  0 35
##                                           
##                Accuracy : 0.8817          
##                  95% CI : (0.7982, 0.9395)
##     No Information Rate : 0.5054          
##     P-Value [Acc > NIR] : 1.516e-14       
##                                           
##                   Kappa : 0.7628          
##                                           
##  Mcnemar's Test P-Value : 0.002569        
##                                           
##             Sensitivity : 1.0000          
##             Specificity : 0.7609          
##          Pos Pred Value : 0.8103          
##          Neg Pred Value : 1.0000          
##              Prevalence : 0.5054          
##          Detection Rate : 0.5054          
##    Detection Prevalence : 0.6237          
##       Balanced Accuracy : 0.8804          
##                                           
##        'Positive' Class : 0               
## 

Confusion matrix and ROC for Model 2:

## Setting levels: control = 0, case = 1
## Setting direction: controls < cases

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 43 10
##          1  4 36
##                                           
##                Accuracy : 0.8495          
##                  95% CI : (0.7603, 0.9152)
##     No Information Rate : 0.5054          
##     P-Value [Acc > NIR] : 3.635e-12       
##                                           
##                   Kappa : 0.6985          
##                                           
##  Mcnemar's Test P-Value : 0.1814          
##                                           
##             Sensitivity : 0.9149          
##             Specificity : 0.7826          
##          Pos Pred Value : 0.8113          
##          Neg Pred Value : 0.9000          
##              Prevalence : 0.5054          
##          Detection Rate : 0.4624          
##    Detection Prevalence : 0.5699          
##       Balanced Accuracy : 0.8488          
##                                           
##        'Positive' Class : 0               
## 

Confusion matrix and ROC for Model 3:

## Setting levels: control = 0, case = 1
## Setting direction: controls < cases

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 46 10
##          1  1 36
##                                           
##                Accuracy : 0.8817          
##                  95% CI : (0.7982, 0.9395)
##     No Information Rate : 0.5054          
##     P-Value [Acc > NIR] : 1.516e-14       
##                                           
##                   Kappa : 0.7629          
##                                           
##  Mcnemar's Test P-Value : 0.01586         
##                                           
##             Sensitivity : 0.9787          
##             Specificity : 0.7826          
##          Pos Pred Value : 0.8214          
##          Neg Pred Value : 0.9730          
##              Prevalence : 0.5054          
##          Detection Rate : 0.4946          
##    Detection Prevalence : 0.6022          
##       Balanced Accuracy : 0.8807          
##                                           
##        'Positive' Class : 0               
## 

Model Comparison

model predictors F1 deviance Accuracy Sensitivity Specificity Precision r2 AIC
Model 1 15 0.8952381 210.1724 0.8817204 0.8817204 0.7628101 0.7982136 0.5934426 242.1724
Model 2 4 0.8600000 278.1438 0.8494624 0.8494624 0.6984715 0.7603395 0.5693541 288.1438
Model 3 8 0.8932039 214.9690 0.8817204 0.8817204 0.7629200 0.7982136 0.5841641 232.9690

When we compare the models we could see that Model 2 has less Accuracy and R2 and Higher AIC also low specificity, so we could reject Model 2.

Model 1 and Model 3 have similar statistics, but in Model 1 we included all the predictors including the derived fields. Since we performed Stepwise-AIC optimization in Model 3 it has lower AIC. Also Model 3 has lower R^2 and but similar Accuracy and Specificity as Model 1. So Model 3 is our choice.

Prediction using evaluation dataset

##    zn indus chas    rm   age    dis rad tax ptratio lstat medv predict
## 1   0  7.07    0 7.185  61.1 4.9671   2 242    17.8  4.03 34.7       0
## 2   0  8.14    0 6.096  84.5 4.4619   4 307    21.0 10.26 18.2       0
## 3   0  8.14    0 6.495  94.4 4.4547   4 307    21.0 12.80 18.4       0
## 4   0  8.14    0 5.950  82.0 3.9900   4 307    21.0 27.71 13.2       0
## 5   0  5.96    0 5.850  41.5 3.9342   5 279    19.2  8.77 21.0       0
## 6  25  5.13    0 5.741  66.2 7.2254   8 284    19.7 13.15 18.7       0
## 7  25  5.13    0 5.966  93.4 6.8185   8 284    19.7 14.44 16.0       0
## 8   0  4.49    0 6.630  56.1 4.4377   3 247    18.5  6.53 26.6       0
## 9   0  4.49    0 6.121  56.8 3.7476   3 247    18.5  8.44 22.2       0
## 10  0  2.89    0 6.163  69.6 3.4952   2 276    18.0 11.34 21.4       0
## 11  0 25.65    0 5.856  97.0 1.9444   2 188    19.1 25.41 17.3       1
## 12  0 25.65    0 5.613  95.6 1.7572   2 188    19.1 27.26 15.7       1
## 13  0 21.89    0 5.637  94.7 1.9799   4 437    21.2 18.34 14.3       1
## 14  0 19.58    0 6.101  93.0 2.2834   5 403    14.7  9.81 25.0       1
## 15  0 19.58    0 5.880  97.3 2.3887   5 403    14.7 12.03 19.1       1
## 16  0 10.59    1 5.960  92.1 3.8771   4 277    18.6 17.27 21.7       0
## 17  0  6.20    0 6.552  21.4 3.3751   8 307    17.4  3.76 31.5       0
## 18  0  6.20    0 8.247  70.4 3.6519   8 307    17.4  3.95 48.3       1
## 19 22  5.86    0 6.957   6.8 8.9067   7 330    19.1  3.53 29.6       0
## 20 90  2.97    0 7.088  20.8 7.3073   1 285    15.3  7.85 32.2       0
## 21 80  1.76    0 6.230  31.5 9.0892   1 241    18.2 12.93 20.1       0
## 22 33  2.18    0 6.616  58.1 3.3700   7 222    18.4  8.93 28.4       0
## 23  0  9.90    0 6.122  52.8 2.6403   4 304    18.4  5.98 22.1       0
## 24  0  7.38    0 6.415  40.1 4.7211   5 287    19.6  6.12 25.0       0
## 25  0  7.38    0 6.312  28.9 5.4159   5 287    19.6  6.15 23.0       0
## 26  0  5.19    0 5.895  59.6 5.6150   5 224    20.2 10.56 18.5       0
## 27 80  2.01    0 6.635  29.7 8.3440   4 280    17.0  5.99 24.5       0
## 28  0 18.10    0 3.561  87.9 1.6132  24 666    20.2  7.12 27.5       1
## 29  0 18.10    1 7.016  97.5 1.2024  24 666    20.2  2.96 50.0       1
## 30  0 18.10    0 6.348  86.1 2.0527  24 666    20.2 17.64 14.5       1
## 31  0 18.10    0 5.935  87.9 1.8206  24 666    20.2 34.02  8.4       1
## 32  0 18.10    0 5.627  93.9 1.8172  24 666    20.2 22.88 12.8       1
## 33  0 18.10    0 5.818  92.4 1.8662  24 666    20.2 22.11 10.5       1
## 34  0 18.10    0 6.219 100.0 2.0048  24 666    20.2 16.59 18.4       1
## 35  0 18.10    0 5.854  96.6 1.8956  24 666    20.2 23.79 10.8       1
## 36  0 18.10    0 6.525  86.5 2.4358  24 666    20.2 18.13 14.1       1
## 37  0 18.10    0 6.376  88.4 2.5671  24 666    20.2 14.65 17.7       1
## 38  0 18.10    0 6.209  65.4 2.9634  24 666    20.2 13.22 21.4       1
## 39  0  9.69    0 5.794  70.6 2.8927   6 391    19.2 14.10 18.3       0
## 40  0 11.93    0 6.976  91.0 2.1675   1 273    21.0  5.64 23.9       0