Overview

In this homework assignment, we will explore, analyze and model a data set containing 466 records. Each record contains 14 variables (13 predictors, 1 response variable) that represent characteristics of a neighborhood The data includes fields related to zoning, demographics, education property values etc.

Data Exploration

Load and Inspect the Data

The first objective is to load, inspect and evaluate the data. We want to inspect a small sample to make sure that the data has loaded correctly and that it looks coherent.

zn indus chas nox rm age dis rad tax ptratio lstat medv target
0 19.58 0 0.605 7.929 96.2 2.0459 5 403 14.7 3.70 50.0 1
0 19.58 1 0.871 5.403 100.0 1.3216 5 403 14.7 26.82 13.4 1
0 18.10 0 0.740 6.485 100.0 1.9784 24 666 20.2 18.85 15.4 1
30 4.93 0 0.428 6.393 7.8 7.0355 6 300 16.6 5.19 23.7 0
0 2.46 0 0.488 7.155 92.2 2.7006 3 193 17.8 4.82 37.9 0

The data looks good, so we can move on to inspection.

Summary Statistics

Next we will compute some basic summary statistics using describe() from the Psych package

vars n mean sd median trimmed
zn 1 466 11.5772532 23.3646511 0.00000 5.3542781
indus 2 466 11.1050215 6.8458549 9.69000 10.9082353
chas 3 466 0.0708155 0.2567920 0.00000 0.0000000
nox 4 466 0.5543105 0.1166667 0.53800 0.5442684
rm 5 466 6.2906738 0.7048513 6.21000 6.2570615
age 6 466 68.3675966 28.3213784 77.15000 70.9553476
dis 7 466 3.7956929 2.1069496 3.19095 3.5443647
rad 8 466 9.5300429 8.6859272 5.00000 8.6978610
tax 9 466 409.5021459 167.9000887 334.50000 401.5080214
ptratio 10 466 18.3984979 2.1968447 18.90000 18.5970588
lstat 11 466 12.6314592 7.1018907 11.35000 11.8809626
medv 12 466 22.5892704 9.2396814 21.20000 21.6304813
target 13 466 0.4914163 0.5004636 0.00000 0.4893048
trimmed mad min max range skew kurtosis se
zn 5.3542781 0.0000000 0.0000 100.0000 100.0000 2.1768152 3.8135765 1.0823466
indus 10.9082353 9.3403800 0.4600 27.7400 27.2800 0.2885450 -1.2432132 0.3171281
chas 0.0000000 0.0000000 0.0000 1.0000 1.0000 3.3354899 9.1451313 0.0118957
nox 0.5442684 0.1334340 0.3890 0.8710 0.4820 0.7463281 -0.0357736 0.0054045
rm 6.2570615 0.5166861 3.8630 8.7800 4.9170 0.4793202 1.5424378 0.0326516
age 70.9553476 30.0226500 2.9000 100.0000 97.1000 -0.5777075 -1.0098814 1.3119625
dis 3.5443647 1.9144814 1.1296 12.1265 10.9969 0.9988926 0.4719679 0.0976026
rad 8.6978610 1.4826000 1.0000 24.0000 23.0000 1.0102788 -0.8619110 0.4023678
tax 401.5080214 104.5233000 187.0000 711.0000 524.0000 0.6593136 -1.1480456 7.7778214
ptratio 18.5970588 1.9273800 12.6000 22.0000 9.4000 -0.7542681 -0.4003627 0.1017669
lstat 11.8809626 7.0720020 1.7300 37.9700 36.2400 0.9055864 0.5033688 0.3289887
medv 21.6304813 6.0045300 5.0000 50.0000 45.0000 1.0766920 1.3737825 0.4280200
target 0.4893048 0.0000000 0.0000 1.0000 1.0000 0.0342293 -2.0031131 0.0231835

There doesn’t seem to be much immediately concerning about the data. We see a some skewness in a few of the non-binary variables and the variable “tax” has a scale that’s an order of magnitude different than all the other but we don’t see anything of major concern here. All of these are relatively easy issues to correct.

Mode and Unique Entry Count

Here we consider the mode (not included above) and a count of the number of unique entries in each column.

## $zn
## [1] 0
## 
## $indus
## [1] 18.1
## 
## $chas
## [1] 0
## 
## $nox
## [1] 0.538
## 
## $rm
## [1] 5.713 6.127 6.167 6.229 6.405 6.417
## 
## $age
## [1] 100
## 
## $dis
## [1] 3.4952 5.2873 5.4007 5.7209 6.8147
## 
## $rad
## [1] 24
## 
## $tax
## [1] 666
## 
## $ptratio
## [1] 20.2
## 
## $lstat
## [1] 6.36 7.79 8.05
## 
## $medv
## [1] 50
## 
## $target
## [1] 0
x
zn 26
indus 73
chas 2
nox 79
rm 419
age 333
dis 380
rad 9
tax 63
ptratio 46
lstat 424
medv 218
target 2

Here we see that we might be able to convert some of these variables to binary variables. ZN and AGE both represent proportions each with bias towards extremes (0% & 100% respectively). We may be able to simplify by converting these columns to a binary “0” or “1”.

If we look at the unique-entry count for each column we see that “rad” only contains 9 distinct values (likely representative of neightborhoods or something similar) which suggests that this variable more categorical than numerical and as such, we’ll convert this to a factor with 9 levels.

Distribution Plots

Next we look at various visualizations of the data to get a feel for the shapes of the distributions and to give us a sense as to whether there are any major outliers that we need to be concerned about.

We do not see much in the way of outliers here either, other than tax, which needs to be re-scaled. In the set of histograms/boxplots we get a good visualization of all of the variables and everything looks coherent at this stage. The ratio of max-median shows that there aren’t any variables that contain absurd outliers.

Relationship to Target

Next we consider the correlation and coefficient of determination of our predictors to our target.

## 
## Correlation method: 'pearson'
## Missing treated using: 'pairwise.complete.obs'

In terms of correlation to the target, we see that there are several variables with an absolute correlation of ~ >= 0.5. The r2 plot shows the explanatory power of each variable and suggests that we might want to focus on variables like “nox”,“age” and “rad” while ignoring “chas” and “rm”. Intuition suggests that we should be able to get a reasonably good prediction with r^2 values such as the above.

Here we see that we might have some aliasing issues where “tax” and “rad” appear to be highly related to one another. These variables represent tax burden and highway accessibility respectively and both be measures of the same thing (specific geographic location or neighborhood, for example). Otherwise, we don’t seem to have much inter-relatedness between variables. This is a good thing.

Data Preparation

There a few minor changes that we can make based on the above observations:

Convert zn and age

We’ll take the proportional data and convert it to binary columns - this is an exercise in logistic regression, after all!. We’ll also retain the original so that we can assess the impact of this change if desired.

Adjust non-Normal Vars

we’ll modify for the skew where required. First, we’ll visually inspect the suspect variables more closely:

vars n mean sd median trimmed mad min max range skew kurtosis se
zn 1 466 11.5772532 23.3646511 0.00000 5.3542781 0.0000000 0.0000 100.0000 100.0000 2.1768152 3.8135765 1.0823466
indus 2 466 11.1050215 6.8458549 9.69000 10.9082353 9.3403800 0.4600 27.7400 27.2800 0.2885450 -1.2432132 0.3171281
chas 3 466 0.0708155 0.2567920 0.00000 0.0000000 0.0000000 0.0000 1.0000 1.0000 3.3354899 9.1451313 0.0118957
nox 4 466 0.5543105 0.1166667 0.53800 0.5442684 0.1334340 0.3890 0.8710 0.4820 0.7463281 -0.0357736 0.0054045
rm 5 466 6.2906738 0.7048513 6.21000 6.2570615 0.5166861 3.8630 8.7800 4.9170 0.4793202 1.5424378 0.0326516
age 6 466 68.3675966 28.3213784 77.15000 70.9553476 30.0226500 2.9000 100.0000 97.1000 -0.5777075 -1.0098814 1.3119625
dis 7 466 3.7956929 2.1069496 3.19095 3.5443647 1.9144814 1.1296 12.1265 10.9969 0.9988926 0.4719679 0.0976026
rad 8 466 9.5300429 8.6859272 5.00000 8.6978610 1.4826000 1.0000 24.0000 23.0000 1.0102788 -0.8619110 0.4023678
tax 9 466 409.5021459 167.9000887 334.50000 401.5080214 104.5233000 187.0000 711.0000 524.0000 0.6593136 -1.1480456 7.7778214
ptratio 10 466 18.3984979 2.1968447 18.90000 18.5970588 1.9273800 12.6000 22.0000 9.4000 -0.7542681 -0.4003627 0.1017669
lstat 11 466 12.6314592 7.1018907 11.35000 11.8809626 7.0720020 1.7300 37.9700 36.2400 0.9055864 0.5033688 0.3289887
medv 12 466 22.5892704 9.2396814 21.20000 21.6304813 6.0045300 5.0000 50.0000 45.0000 1.0766920 1.3737825 0.4280200
target 13 466 0.4914163 0.5004636 0.00000 0.4893048 0.0000000 0.0000 1.0000 1.0000 0.0342293 -2.0031131 0.0231835
znBin 14 466 0.2725322 0.4457407 0.00000 0.2165775 0.0000000 0.0000 1.0000 1.0000 1.0184383 -0.9648402 0.0206485
ageBin 15 466 0.0901288 0.2866739 0.00000 0.0000000 0.0000000 0.0000 1.0000 1.0000 2.8533585 6.1548765 0.0132799

next we’ll modify and re-inspect:

vars n mean sd median trimmed mad min max range skew kurtosis se
zn 1 466 11.5772532 23.3646511 0.0000000 5.3542781 0.0000000 0.0000000 100.0000000 100.0000000 2.1768152 3.8135765 1.0823466
indus 2 466 11.1050215 6.8458549 9.6900000 10.9082353 9.3403800 0.4600000 27.7400000 27.2800000 0.2885450 -1.2432132 0.3171281
chas 3 466 0.0708155 0.2567920 0.0000000 0.0000000 0.0000000 0.0000000 1.0000000 1.0000000 3.3354899 9.1451313 0.0118957
nox 4 466 -0.6109729 0.2026121 -0.6198967 -0.6202169 0.2549574 -0.9441759 -0.1381133 0.8060626 0.3709699 -0.7234642 0.0093858
rm 5 466 6.2906738 0.7048513 6.2100000 6.2570615 0.5166861 3.8630000 8.7800000 4.9170000 0.4793202 1.5424378 0.0326516
age 6 466 68.3675966 28.3213784 77.1500000 70.9553476 30.0226500 2.9000000 100.0000000 97.1000000 -0.5777075 -1.0098814 1.3119625
dis 7 466 1.1875848 0.5412299 1.1603153 1.1782986 0.6732974 0.1218636 2.4953931 2.3735296 0.1426949 -1.0086881 0.0250720
rad 8 466 1.8706996 0.8653619 1.6094379 1.8764842 0.3308326 0.0000000 3.1780538 3.1780538 0.3191654 -0.6307697 0.0400871
tax 9 466 19.8348783 4.0142657 18.2893361 19.7342218 3.0263710 13.6747943 26.6645833 12.9897889 0.5091007 -1.1856441 0.1859573
ptratio 10 466 18.3984979 2.1968447 18.9000000 18.5970588 1.9273800 12.6000000 22.0000000 9.4000000 -0.7542681 -0.4003627 0.1017669
lstat 11 466 12.6314592 7.1018907 11.3500000 11.8809626 7.0720020 1.7300000 37.9700000 36.2400000 0.9055864 0.5033688 0.3289887
medv 12 466 22.5892704 9.2396814 21.2000000 21.6304813 6.0045300 5.0000000 50.0000000 45.0000000 1.0766920 1.3737825 0.4280200
target 13 466 0.4914163 0.5004636 0.0000000 0.4893048 0.0000000 0.0000000 1.0000000 1.0000000 0.0342293 -2.0031131 0.0231835
znBin 14 466 0.2725322 0.4457407 0.0000000 0.2165775 0.0000000 0.0000000 1.0000000 1.0000000 1.0184383 -0.9648402 0.0206485
ageBin 15 466 0.0901288 0.2866739 0.0000000 0.0000000 0.0000000 0.0000000 1.0000000 1.0000000 2.8533585 6.1548765 0.0132799

We can see that we get much more normal looking results after the transformation

Converting Categorical data to Factor

For the variable “rad” which we previously identified as only having 9 distinct values, we’ll convert it to a factor and encode the values as “1-9” to make life easier.

split data after transforms

Finally, we’ll split the data into a train and test set after our transforms. This way, we can perform some model eval using the testing set.

Building Models

First, we’ll build a model using all of the data exactly as it appears in the original file.

## 
## Call:
## glm(formula = data.orig.train$target ~ ., family = "binomial", 
##     data = data.orig.train[, ])
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.00434  -0.11934   0.00000   0.00125   3.07110  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -43.398974   8.026598  -5.407 6.41e-08 ***
## zn           -0.178760   0.086629  -2.064 0.039064 *  
## indus        -0.137292   0.065978  -2.081 0.037445 *  
## chas          0.901319   0.845654   1.066 0.286503    
## nox          59.004084  11.013512   5.357 8.44e-08 ***
## rm           -0.963916   0.878829  -1.097 0.272721    
## age           0.051233   0.017587   2.913 0.003577 ** 
## dis           0.771590   0.296352   2.604 0.009224 ** 
## rad           0.759221   0.229646   3.306 0.000946 ***
## tax          -0.008198   0.003818  -2.147 0.031760 *  
## ptratio       0.347956   0.168622   2.064 0.039063 *  
## lstat         0.061905   0.073619   0.841 0.400412    
## medv          0.176052   0.079204   2.223 0.026232 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 483.81  on 348  degrees of freedom
## Residual deviance: 137.33  on 336  degrees of freedom
## AIC: 163.33
## 
## Number of Fisher Scoring iterations: 9
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 59  9
##          1  3 46
##                                           
##                Accuracy : 0.8974          
##                  95% CI : (0.8277, 0.9459)
##     No Information Rate : 0.5299          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.7929          
##                                           
##  Mcnemar's Test P-Value : 0.1489          
##                                           
##             Sensitivity : 0.8364          
##             Specificity : 0.9516          
##          Pos Pred Value : 0.9388          
##          Neg Pred Value : 0.8676          
##              Prevalence : 0.4701          
##          Detection Rate : 0.3932          
##    Detection Prevalence : 0.4188          
##       Balanced Accuracy : 0.8940          
##                                           
##        'Positive' Class : 1               
## 

## # A tibble: 3 x 22
##   .rownames data.orig.train~    zn indus  chas   nox    rm   age   dis
##   <chr>                <int> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <dbl>
## 1 62                       1     0  9.9      0 0.544  4.97  37.8  2.52
## 2 304                      0     0  9.69     0 0.585  5.67  28.8  2.80
## 3 457                      1     0 10.6      0 0.489  5.41   9.8  3.59
## # ... with 13 more variables: rad <int>, tax <int>, ptratio <dbl>,
## #   lstat <dbl>, medv <dbl>, .fitted <dbl>, .se.fit <dbl>, .resid <dbl>,
## #   .hat <dbl>, .sigma <dbl>, .cooksd <dbl>, .std.resid <dbl>, index <int>

We get a surprisingly decent result with an accuracy near 90% and and AIC of 163 - my intuition is that we are beginning near the point of diminishing returns in this particular assignment - it may be hard to improve upon this model. However, we do see some sizable outliers in the Cook’s D plot and the resituals plot shows some values > 3.

Next, we’ll run the same thing without modified data using all the variables. Note also that we’ll run this model twice. First, we’ll run it with the modified binary ZN and Age variables, then without.

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## 
## Call:
## glm(formula = data.train$target ~ ., family = "binomial", data = data.train[, 
##     -c(ncol(data.train) - 1, ncol(data.train) - 2)])
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.32383  -0.00102   0.00000   0.00003   3.03244  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   31.38596   10.06861   3.117 0.001826 ** 
## zn            -0.27736    0.16098  -1.723 0.084903 .  
## indus         -0.27203    0.16571  -1.642 0.100678    
## chas          -0.59856    1.21751  -0.492 0.622982    
## nox           41.64628   10.71680   3.886 0.000102 ***
## rm            -1.85110    1.47715  -1.253 0.210149    
## age            0.02958    0.02078   1.423 0.154711    
## dis            0.93825    1.80340   0.520 0.602878    
## rad2          24.45937 2195.68961   0.011 0.991112    
## rad3          -2.90651    1.72127  -1.689 0.091298 .  
## rad4         -20.96470 4525.58117  -0.005 0.996304    
## rad5         -19.34965 5361.28382  -0.004 0.997120    
## rad6          -8.03856 7076.31647  -0.001 0.999094    
## rad7           2.40880    0.85061   2.832 0.004628 ** 
## rad8           9.01548    6.18967   1.457 0.145244    
## rad9         -19.76591 5742.74294  -0.003 0.997254    
## tax           -0.34797    0.33414  -1.041 0.297692    
## ptratio        0.12692    0.24609   0.516 0.606025    
## lstat          0.11121    0.09349   1.190 0.234228    
## medv           0.30831    0.13645   2.259 0.023853 *  
## ageBin        18.00070 3684.00775   0.005 0.996101    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 483.814  on 348  degrees of freedom
## Residual deviance:  72.359  on 328  degrees of freedom
## AIC: 114.36
## 
## Number of Fisher Scoring iterations: 20
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## 
## Call:
## glm(formula = data.train$target ~ ., family = "binomial", data = data.train[, 
##     -c(1, 6)])
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.36372  -0.00308   0.00000   0.00002   3.03065  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   30.53433   10.33514   2.954  0.00313 ** 
## indus         -0.33242    0.15609  -2.130  0.03320 *  
## chas          -0.09346    1.14803  -0.081  0.93512    
## nox           44.73128   10.42712   4.290 1.79e-05 ***
## rm            -0.69199    1.27802  -0.541  0.58819    
## dis            0.65532    1.89771   0.345  0.72985    
## rad2          23.90571 2114.21608   0.011  0.99098    
## rad3          -3.63477    1.64934  -2.204  0.02754 *  
## rad4         -20.28378 4696.78990  -0.004  0.99655    
## rad5         -19.46388 5660.42644  -0.003  0.99726    
## rad6         -11.71883 7179.94527  -0.002  0.99870    
## rad7           2.48793    0.83783   2.970  0.00298 ** 
## rad8           8.04944    4.93686   1.630  0.10300    
## rad9         -18.36249 6074.08684  -0.003  0.99759    
## tax           -0.22495    0.31289  -0.719  0.47217    
## ptratio        0.01028    0.22674   0.045  0.96384    
## lstat          0.15562    0.09351   1.664  0.09605 .  
## medv           0.22290    0.11226   1.985  0.04709 *  
## znBin         -5.60072    3.94063  -1.421  0.15524    
## ageBin        18.83165 3645.07888   0.005  0.99588    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 483.81  on 348  degrees of freedom
## Residual deviance:  74.30  on 329  degrees of freedom
## AIC: 114.3
## 
## Number of Fisher Scoring iterations: 20
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 61  3
##          1  1 52
##                                           
##                Accuracy : 0.9658          
##                  95% CI : (0.9148, 0.9906)
##     No Information Rate : 0.5299          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.9312          
##                                           
##  Mcnemar's Test P-Value : 0.6171          
##                                           
##             Sensitivity : 0.9455          
##             Specificity : 0.9839          
##          Pos Pred Value : 0.9811          
##          Neg Pred Value : 0.9531          
##              Prevalence : 0.4701          
##          Detection Rate : 0.4444          
##    Detection Prevalence : 0.4530          
##       Balanced Accuracy : 0.9647          
##                                           
##        'Positive' Class : 1               
## 
## # A tibble: 3 x 22
##   .rownames data.train.targ~ indus  chas    nox    rm   dis rad     tax
##   <chr>                <int> <dbl> <int>  <dbl> <dbl> <dbl> <fct> <dbl>
## 1 240                      0 21.9      0 -0.472  5.86 0.512 7      20.9
## 2 249                      1  8.56     0 -0.654  6.23 0.934 1      19.6
## 3 354                      1 10.6      1 -0.715  5.40 1.30  7      16.6
## # ... with 13 more variables: ptratio <dbl>, lstat <dbl>, medv <dbl>,
## #   znBin <dbl>, ageBin <dbl>, .fitted <dbl>, .se.fit <dbl>, .resid <dbl>,
## #   .hat <dbl>, .sigma <dbl>, .cooksd <dbl>, .std.resid <dbl>, index <int>

With the modified data, we can see that the AIC drops to 114 (~30% drop from the last model) and the accuracy of the model improves to nearly 97%! Once again, we see a few large outliers which we would like to do something about if possible.

Note also that it makes almost no difference whether we use binary or proportional data for ZN and AGE. Given that binary variables are simpler, I’ve opted to take the (incredibly small) performance hit here and stick with the binary variables.

Finally, we’ll try cherry-picking some of the more important variables based on what we’ve observed above.

Using RegFit

First we’ll use regfit to sub-select the important variables

## Subset selection object
## Call: regsubsets.formula(data.orig.train$target ~ ., , data = data.orig.train, 
##     nvmax = 12)
## 12 Variables  (and intercept)
##         Forced in Forced out
## zn          FALSE      FALSE
## indus       FALSE      FALSE
## chas        FALSE      FALSE
## nox         FALSE      FALSE
## rm          FALSE      FALSE
## age         FALSE      FALSE
## dis         FALSE      FALSE
## rad         FALSE      FALSE
## tax         FALSE      FALSE
## ptratio     FALSE      FALSE
## lstat       FALSE      FALSE
## medv        FALSE      FALSE
## 1 subsets of each size up to 12
## Selection Algorithm: exhaustive
##           zn  indus chas nox rm  age dis rad tax ptratio lstat medv
## 1  ( 1 )  " " " "   " "  "*" " " " " " " " " " " " "     " "   " " 
## 2  ( 1 )  " " " "   " "  "*" " " " " " " "*" " " " "     " "   " " 
## 3  ( 1 )  " " " "   " "  "*" " " "*" " " "*" " " " "     " "   " " 
## 4  ( 1 )  " " " "   " "  "*" " " "*" " " "*" " " " "     " "   "*" 
## 5  ( 1 )  " " " "   " "  "*" " " "*" " " "*" "*" " "     " "   "*" 
## 6  ( 1 )  " " " "   "*"  "*" " " "*" " " "*" "*" " "     " "   "*" 
## 7  ( 1 )  " " " "   "*"  "*" " " "*" " " "*" "*" "*"     " "   "*" 
## 8  ( 1 )  " " " "   "*"  "*" " " "*" " " "*" "*" "*"     "*"   "*" 
## 9  ( 1 )  " " " "   "*"  "*" "*" "*" " " "*" "*" "*"     "*"   "*" 
## 10  ( 1 ) "*" " "   "*"  "*" "*" "*" " " "*" "*" "*"     "*"   "*" 
## 11  ( 1 ) "*" " "   "*"  "*" "*" "*" "*" "*" "*" "*"     "*"   "*" 
## 12  ( 1 ) "*" "*"   "*"  "*" "*" "*" "*" "*" "*" "*"     "*"   "*"
## [1] 4

The above tells us that iteration #4, which contains only nox,age,rad,medv should be the best (lowest Cp) performing model, so let’s try it!

## 
## Call:
## glm(formula = data.train$target ~ nox + age + rad + medv, family = "binomial", 
##     data = data.train[, -c(ncol(data.test) - 1, ncol(data.test) - 
##         2)])
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.43430  -0.22886   0.00000   0.00016   2.61216  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    6.29545    2.79550   2.252 0.024322 *  
## nox           17.34888    3.33260   5.206 1.93e-07 ***
## age            0.01467    0.01302   1.126 0.260022    
## rad2          19.54788 1632.34303   0.012 0.990445    
## rad3          -1.10292    0.92653  -1.190 0.233901    
## rad4         -17.63571 3160.23153  -0.006 0.995547    
## rad5         -18.36714 3823.50441  -0.005 0.996167    
## rad6         -16.43258 4890.47714  -0.003 0.997319    
## rad7           2.14043    0.58892   3.635 0.000278 ***
## rad8           4.32561    1.05535   4.099 4.15e-05 ***
## rad9         -19.11541 3722.90641  -0.005 0.995903    
## medv           0.09140    0.03970   2.302 0.021320 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 483.81  on 348  degrees of freedom
## Residual deviance: 115.87  on 337  degrees of freedom
## AIC: 139.87
## 
## Number of Fisher Scoring iterations: 19
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 62  3
##          1  0 52
##                                           
##                Accuracy : 0.9744          
##                  95% CI : (0.9269, 0.9947)
##     No Information Rate : 0.5299          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.9484          
##                                           
##  Mcnemar's Test P-Value : 0.2482          
##                                           
##             Sensitivity : 0.9455          
##             Specificity : 1.0000          
##          Pos Pred Value : 1.0000          
##          Neg Pred Value : 0.9538          
##              Prevalence : 0.4701          
##          Detection Rate : 0.4444          
##    Detection Prevalence : 0.4444          
##       Balanced Accuracy : 0.9727          
##                                           
##        'Positive' Class : 1               
## 
## # A tibble: 3 x 14
##   .rownames data.train.targ~    nox   age rad    medv .fitted .se.fit
##   <chr>                <int>  <dbl> <dbl> <fct> <dbl>   <dbl>   <dbl>
## 1 263                      1 -0.536  42.6 3      24.5   -1.24   0.861
## 2 457                      1 -0.715   9.8 7      23.7   -1.67   0.747
## 3 69                       1 -0.536  72.9 3      19.7   -1.24   0.779
## # ... with 6 more variables: .resid <dbl>, .hat <dbl>, .sigma <dbl>,
## #   .cooksd <dbl>, .std.resid <dbl>, index <int>

This optimized model blows the original model out of the water with an AIC of 140 and an accuracy of 97.5%. We can see also that we have no residuals with values > abs(3) which is good also. We have a simpler (fewer variables) model with a better prediction and smaller residuals - This is the model we’ll run with.

Model Selection

Based on the above, we’ll go with the model_alpha as it shows superior performance, lower error and contains fewer predictors than the other two that were tested.

First we’ll prep the out-sample data in the same way as was performed in-sample:

Next we’ll make out preditcions:

p
1 0
2 1
3 1
4 1
5 0
6 0
7 0
8 0
9 0
10 0
11 0
12 0
13 1
14 1
15 0
16 1
17 0
18 0
19 0
20 0
21 0
22 0
23 1
24 0
25 0
26 0
27 1
28 1
29 1
30 1
31 1
32 1
33 1
34 1
35 1
36 1
37 1
38 1
39 0
40 1