Visit my website for more like this!

Data Sources:

Heavily borrowed from:

require(knitr)
## Loading required package: knitr

Goals

Using the Boston data set, fit classification models in order to predict whether a given suburb has a crime rate above or below the median. Explore logistic regression, LDA, and KNN models using various sub- sets of the predictors. Describe your findings.

Load the data

library(ISLR)
library(MASS)
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
attach(Boston)
help(Boston) # gives info about the variables of the data
str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
# Create response variable
Boston$resp <- "No"
Boston$resp[crim > median(crim)] <- 'Yes'
Boston$resp <-factor(Boston$resp)
table(Boston$resp)
## 
##  No Yes 
## 253 253
# Drop old crim variable
Boston <- Boston[-drop(1)]

The next thing we need to do is create training and testing data.

inTrain <- createDataPartition(y = Boston$resp, p = 0.75, list = FALSE)

train <- Boston[inTrain,]
test <- Boston[-inTrain,]
nzv <- nearZeroVar(train, saveMetrics = TRUE)
nzv
##         freqRatio percentUnique zeroVar   nzv
## zn         17.125        6.5789   FALSE FALSE
## indus       4.857       18.9474   FALSE FALSE
## chas       14.833        0.5263   FALSE FALSE
## nox         1.143       20.5263   FALSE FALSE
## rm          1.000       90.5263   FALSE FALSE
## age        11.667       74.4737   FALSE FALSE
## dis         1.000       82.8947   FALSE FALSE
## rad         1.259        2.3684   FALSE FALSE
## tax         3.290       17.1053   FALSE FALSE
## ptratio     4.240       12.1053   FALSE FALSE
## black      46.500       71.8421   FALSE FALSE
## lstat       1.000       90.7895   FALSE FALSE
## medv        2.143       53.9474   FALSE FALSE
## resp        1.000        0.5263   FALSE FALSE
Cor <- cor(train[,-14])
Cor
##               zn    indus      chas     nox       rm     age      dis
## zn       1.00000 -0.53549 -0.026473 -0.5250  0.31489 -0.5745  0.66451
## indus   -0.53549  1.00000  0.074446  0.7778 -0.39419  0.6431 -0.69020
## chas    -0.02647  0.07445  1.000000  0.1058  0.08437  0.0861 -0.09698
## nox     -0.52497  0.77777  0.105766  1.0000 -0.30658  0.7456 -0.76693
## rm       0.31489 -0.39419  0.084365 -0.3066  1.00000 -0.2392  0.19477
## age     -0.57449  0.64305  0.086100  0.7456 -0.23916  1.0000 -0.73289
## dis      0.66451 -0.69020 -0.096984 -0.7669  0.19477 -0.7329  1.00000
## rad     -0.32958  0.64186  0.004018  0.6564 -0.25732  0.4872 -0.51689
## tax     -0.34114  0.77401 -0.011177  0.7071 -0.34184  0.5382 -0.54999
## ptratio -0.40617  0.41163 -0.130817  0.2374 -0.39385  0.2881 -0.24587
## black    0.18651 -0.37516  0.065125 -0.3531  0.14337 -0.2740  0.28355
## lstat   -0.41840  0.59965 -0.086266  0.6006 -0.63426  0.5834 -0.47855
## medv     0.34953 -0.46928  0.215232 -0.4203  0.72468 -0.3594  0.22007
##               rad      tax ptratio    black    lstat    medv
## zn      -0.329579 -0.34114 -0.4062  0.18651 -0.41840  0.3495
## indus    0.641863  0.77401  0.4116 -0.37516  0.59965 -0.4693
## chas     0.004018 -0.01118 -0.1308  0.06512 -0.08627  0.2152
## nox      0.656410  0.70706  0.2374 -0.35308  0.60060 -0.4203
## rm      -0.257319 -0.34184 -0.3939  0.14337 -0.63426  0.7247
## age      0.487234  0.53819  0.2881 -0.27405  0.58340 -0.3594
## dis     -0.516887 -0.54999 -0.2459  0.28355 -0.47855  0.2201
## rad      1.000000  0.91188  0.4770 -0.47595  0.54082 -0.3882
## tax      0.911883  1.00000  0.4871 -0.46785  0.60040 -0.4777
## ptratio  0.476980  0.48714  1.0000 -0.23353  0.42103 -0.5283
## black   -0.475948 -0.46785 -0.2335  1.00000 -0.36606  0.3297
## lstat    0.540823  0.60040  0.4210 -0.36606  1.00000 -0.7436
## medv    -0.388190 -0.47772 -0.5283  0.32967 -0.74364  1.0000
highCor <- findCorrelation(Cor, cutoff = 0.75)
highCor
## [1] 2 9 4

No variables are near zero variance. However, we see that industrial area and NOx are highly correlated with each other (and other variables), property tax is also highly correlated with other variables. For now we will remove the property tax variable and NOx.

train_cor <- train[,-drop(c(2,9))]
test_cor <- test[,-drop(c(2,9))]

Now let’s try a few models…

KNN

knnGrid <- expand.grid(.k=c(2))
# Use k = 2, since we expect 2 classes
KNN <- train(x=train_cor[,-12], method='knn',
             y=train_cor$resp, 
             preProcess=c('center', 'scale'), 
             tuneGrid = knnGrid)
KNN
## k-Nearest Neighbors 
## 
## 380 samples
##  11 predictors
##   2 classes: 'No', 'Yes' 
## 
## Pre-processing: centered, scaled 
## Resampling: Bootstrapped (25 reps) 
## 
## Summary of sample sizes: 380, 380, 380, 380, 380, 380, ... 
## 
## Resampling results
## 
##   Accuracy  Kappa  Accuracy SD  Kappa SD
##   0.9       0.7    0.03         0.06    
## 
## Tuning parameter 'k' was held constant at a value of 2
## 
confusionMatrix(predict(KNN, test_cor[,-12]), test_cor$resp)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction No Yes
##        No  57   9
##        Yes  6  54
##                                         
##                Accuracy : 0.881         
##                  95% CI : (0.811, 0.932)
##     No Information Rate : 0.5           
##     P-Value [Acc > NIR] : <2e-16        
##                                         
##                   Kappa : 0.762         
##  Mcnemar's Test P-Value : 0.606         
##                                         
##             Sensitivity : 0.905         
##             Specificity : 0.857         
##          Pos Pred Value : 0.864         
##          Neg Pred Value : 0.900         
##              Prevalence : 0.500         
##          Detection Rate : 0.452         
##    Detection Prevalence : 0.524         
##       Balanced Accuracy : 0.881         
##                                         
##        'Positive' Class : No            
## 

This gives us pretty good results, however, let’s try using principal component analysis instead of removing correlated variables.

KNN <- train(x=train[,-14], method='knn',
             y=train$resp, 
             preProcess=c('center', 'scale', 'pca'), 
             tuneGrid = knnGrid)
KNN
## k-Nearest Neighbors 
## 
## 380 samples
##  13 predictors
##   2 classes: 'No', 'Yes' 
## 
## Pre-processing: centered, scaled, principal component signal extraction 
## Resampling: Bootstrapped (25 reps) 
## 
## Summary of sample sizes: 380, 380, 380, 380, 380, 380, ... 
## 
## Resampling results
## 
##   Accuracy  Kappa  Accuracy SD  Kappa SD
##   0.9       0.7    0.03         0.06    
## 
## Tuning parameter 'k' was held constant at a value of 2
## 
confusionMatrix(predict(KNN, test[,-14]), test$resp)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction No Yes
##        No  56   5
##        Yes  7  58
##                                       
##                Accuracy : 0.905       
##                  95% CI : (0.84, 0.95)
##     No Information Rate : 0.5         
##     P-Value [Acc > NIR] : <2e-16      
##                                       
##                   Kappa : 0.81        
##  Mcnemar's Test P-Value : 0.773       
##                                       
##             Sensitivity : 0.889       
##             Specificity : 0.921       
##          Pos Pred Value : 0.918       
##          Neg Pred Value : 0.892       
##              Prevalence : 0.500       
##          Detection Rate : 0.444       
##    Detection Prevalence : 0.484       
##       Balanced Accuracy : 0.905       
##                                       
##        'Positive' Class : No          
## 

Indeed, this model is about 4% better.

Logistic Regression

logit <- train(resp~., data=train, 
               method='glm', family=binomial(link='logit'),
               preProcess=c('scale', 'center'))
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(logit)
## 
## Call:
## NULL
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -2.170  -0.125   0.000   0.001   3.461  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   3.2401     0.8371    3.87  0.00011 ***
## zn           -2.2104     0.9356   -2.36  0.01815 *  
## indus        -0.0898     0.3733   -0.24  0.80982    
## chas          0.3581     0.2507    1.43  0.15323    
## nox           5.7468     1.0122    5.68  1.4e-08 ***
## rm           -0.1818     0.6197   -0.29  0.76927    
## age           0.3849     0.3878    0.99  0.32094    
## dis           1.5544     0.5447    2.85  0.00432 ** 
## rad           7.2335     1.6490    4.39  1.2e-05 ***
## tax          -1.6351     0.6203   -2.64  0.00839 ** 
## ptratio       0.7714     0.3277    2.35  0.01856 *  
## black        -0.9310     0.5499   -1.69  0.09045 .  
## lstat         0.4968     0.4119    1.21  0.22772    
## medv          1.4587     0.8121    1.80  0.07247 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 526.79  on 379  degrees of freedom
## Residual deviance: 143.52  on 366  degrees of freedom
## AIC: 171.5
## 
## Number of Fisher Scoring iterations: 9
plot(logit$finalModel,)

plot of chunk unnamed-chunk-9plot of chunk unnamed-chunk-9plot of chunk unnamed-chunk-9plot of chunk unnamed-chunk-9

confusionMatrix(predict(logit, test[,-14]), test$resp)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction No Yes
##        No  50   8
##        Yes 13  55
##                                         
##                Accuracy : 0.833         
##                  95% CI : (0.757, 0.894)
##     No Information Rate : 0.5           
##     P-Value [Acc > NIR] : 6.27e-15      
##                                         
##                   Kappa : 0.667         
##  Mcnemar's Test P-Value : 0.383         
##                                         
##             Sensitivity : 0.794         
##             Specificity : 0.873         
##          Pos Pred Value : 0.862         
##          Neg Pred Value : 0.809         
##              Prevalence : 0.500         
##          Detection Rate : 0.397         
##    Detection Prevalence : 0.460         
##       Balanced Accuracy : 0.833         
##                                         
##        'Positive' Class : No            
## 

Most of these variables are actually not significant in the model, we might as well remove the non-significant ones.

library(car)

logit <- train(resp ~ rad + nox,
               data=train, 
               method='glm', family=binomial(link='logit'),
               preProcess=c('scale', 'center'))
summary(logit)
## 
## Call:
## NULL
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.9882  -0.3293  -0.0245   0.0038   2.5937  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    2.922      0.607    4.81  1.5e-06 ***
## rad            4.903      1.017    4.82  1.4e-06 ***
## nox            3.282      0.429    7.66  1.9e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 526.79  on 379  degrees of freedom
## Residual deviance: 188.13  on 377  degrees of freedom
## AIC: 194.1
## 
## Number of Fisher Scoring iterations: 8
plot(logit$finalModel, which=1)

plot of chunk unnamed-chunk-10

vif(logit$finalModel)
##   rad   nox 
## 1.136 1.136
confusionMatrix(predict(logit, test[,-14]), test$resp)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction No Yes
##        No  53  14
##        Yes 10  49
##                                        
##                Accuracy : 0.81         
##                  95% CI : (0.73, 0.874)
##     No Information Rate : 0.5          
##     P-Value [Acc > NIR] : 6.07e-13     
##                                        
##                   Kappa : 0.619        
##  Mcnemar's Test P-Value : 0.54         
##                                        
##             Sensitivity : 0.841        
##             Specificity : 0.778        
##          Pos Pred Value : 0.791        
##          Neg Pred Value : 0.831        
##              Prevalence : 0.500        
##          Detection Rate : 0.421        
##    Detection Prevalence : 0.532        
##       Balanced Accuracy : 0.810        
##                                        
##        'Positive' Class : No           
## 

LDA

LDA <- train(resp~., data=train_cor,
             method='lda', 
             preProcess=c('scale', 'center'))
LDA
## Linear Discriminant Analysis 
## 
## 380 samples
##  11 predictors
##   2 classes: 'No', 'Yes' 
## 
## Pre-processing: scaled, centered 
## Resampling: Bootstrapped (25 reps) 
## 
## Summary of sample sizes: 380, 380, 380, 380, 380, 380, ... 
## 
## Resampling results
## 
##   Accuracy  Kappa  Accuracy SD  Kappa SD
##   0.9       0.7    0.02         0.04    
## 
## 
confusionMatrix(test_cor$resp, predict(LDA, test_cor[,-12]))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction No Yes
##        No  56   7
##        Yes 16  47
##                                         
##                Accuracy : 0.817         
##                  95% CI : (0.739, 0.881)
##     No Information Rate : 0.571         
##     P-Value [Acc > NIR] : 4.15e-09      
##                                         
##                   Kappa : 0.635         
##  Mcnemar's Test P-Value : 0.0953        
##                                         
##             Sensitivity : 0.778         
##             Specificity : 0.870         
##          Pos Pred Value : 0.889         
##          Neg Pred Value : 0.746         
##              Prevalence : 0.571         
##          Detection Rate : 0.444         
##    Detection Prevalence : 0.500         
##       Balanced Accuracy : 0.824         
##                                         
##        'Positive' Class : No            
## 

QDA

QDA <- train(resp~., data=train_cor,
             method='qda', 
             preProcess=c('scale', 'center'))
QDA
## Quadratic Discriminant Analysis 
## 
## 380 samples
##  11 predictors
##   2 classes: 'No', 'Yes' 
## 
## Pre-processing: scaled, centered 
## Resampling: Bootstrapped (25 reps) 
## 
## Summary of sample sizes: 380, 380, 380, 380, 380, 380, ... 
## 
## Resampling results
## 
##   Accuracy  Kappa  Accuracy SD  Kappa SD
##   0.9       0.7    0.03         0.05    
## 
## 
confusionMatrix(test_cor$resp, predict(LDA, test_cor[,-12]))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction No Yes
##        No  56   7
##        Yes 16  47
##                                         
##                Accuracy : 0.817         
##                  95% CI : (0.739, 0.881)
##     No Information Rate : 0.571         
##     P-Value [Acc > NIR] : 4.15e-09      
##                                         
##                   Kappa : 0.635         
##  Mcnemar's Test P-Value : 0.0953        
##                                         
##             Sensitivity : 0.778         
##             Specificity : 0.870         
##          Pos Pred Value : 0.889         
##          Neg Pred Value : 0.746         
##              Prevalence : 0.571         
##          Detection Rate : 0.444         
##    Detection Prevalence : 0.500         
##       Balanced Accuracy : 0.824         
##                                         
##        'Positive' Class : No            
## 

These results are slightly better than LDA, but less than Logistic regression.

Random Forests

library(doMC)
## Loading required package: foreach
## Loading required package: iterators
## Loading required package: parallel
registerDoMC(6)

fun <- train(y=train$resp, x=train[,-14])
## Loading required package: randomForest
## randomForest 4.6-7
## Type rfNews() to see new features/changes/bug fixes.
fun
## Random Forest 
## 
## 380 samples
##  13 predictors
##   2 classes: 'No', 'Yes' 
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## 
## Summary of sample sizes: 380, 380, 380, 380, 380, 380, ... 
## 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy  Kappa  Accuracy SD  Kappa SD
##   2     1         0.9    0.01         0.02    
##   7     1         0.9    0.01         0.02    
##   10    1         0.9    0.01         0.03    
## 
## Accuracy was used to select the optimal model using  the largest value.
## The final value used for the model was mtry = 7.
confusionMatrix(test$resp, predict(fun, test[,-14]))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction No Yes
##        No  59   4
##        Yes  1  62
##                                        
##                Accuracy : 0.96         
##                  95% CI : (0.91, 0.987)
##     No Information Rate : 0.524        
##     P-Value [Acc > NIR] : <2e-16       
##                                        
##                   Kappa : 0.921        
##  Mcnemar's Test P-Value : 0.371        
##                                        
##             Sensitivity : 0.983        
##             Specificity : 0.939        
##          Pos Pred Value : 0.937        
##          Neg Pred Value : 0.984        
##              Prevalence : 0.476        
##          Detection Rate : 0.468        
##    Detection Prevalence : 0.500        
##       Balanced Accuracy : 0.961        
##                                        
##        'Positive' Class : No           
## 

Moving forward with logistic regression

One of the benefits of logistic regression is that we can begin to understand which variables are more important to the model, therefore we gain insight into the behavior of the real word phenomena. Well this isn’t the best model, we will use this simpler model as an example to explain the results.

Interpreting the odds ratios

# Getting odds ratio and 95% conf intervals
exp(cbind(OR = coef(logit$finalModel), confint(logit$finalModel)))
## Waiting for profiling to be done...
##                 OR  2.5 %  97.5 %
## (Intercept)  18.58  5.917   64.78
## rad         134.63 20.001 1099.06
## nox          26.64 12.277   66.69

From these results we can see that accessibility to highways (rad) is a very predominant factor for crime. For every one unit increase in rad, the odds of having crime > the median crime increases by a factor of 145. Also, for every one unit increase in NOx concentration, the odds of crime > median crime increases by 26. We can see very clearly what is happening here if we plot the data.

attach(Boston)
## The following objects are masked from Boston (position 10):
## 
##     age, black, chas, dis, indus, lstat, medv, nox, ptratio, rad,
##     rm, tax, zn
plot(Boston$resp, rad)

plot of chunk unnamed-chunk-15