Visit my website for more like this!
Heavily borrowed from:
Textbook: Introduction to statistical learning
Textbook: Elements of statistical learning
require(knitr)
## Loading required package: knitr
Using the Boston data set, fit classification models in order to predict whether a given suburb has a crime rate above or below the median. Explore logistic regression, LDA, and KNN models using various sub- sets of the predictors. Describe your findings.
Load the data
library(ISLR)
library(MASS)
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
attach(Boston)
help(Boston) # gives info about the variables of the data
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
# Create response variable
Boston$resp <- "No"
Boston$resp[crim > median(crim)] <- 'Yes'
Boston$resp <-factor(Boston$resp)
table(Boston$resp)
##
## No Yes
## 253 253
# Drop old crim variable
Boston <- Boston[-drop(1)]
The next thing we need to do is create training and testing data.
inTrain <- createDataPartition(y = Boston$resp, p = 0.75, list = FALSE)
train <- Boston[inTrain,]
test <- Boston[-inTrain,]
nzv <- nearZeroVar(train, saveMetrics = TRUE)
nzv
## freqRatio percentUnique zeroVar nzv
## zn 17.125 6.5789 FALSE FALSE
## indus 4.857 18.9474 FALSE FALSE
## chas 14.833 0.5263 FALSE FALSE
## nox 1.143 20.5263 FALSE FALSE
## rm 1.000 90.5263 FALSE FALSE
## age 11.667 74.4737 FALSE FALSE
## dis 1.000 82.8947 FALSE FALSE
## rad 1.259 2.3684 FALSE FALSE
## tax 3.290 17.1053 FALSE FALSE
## ptratio 4.240 12.1053 FALSE FALSE
## black 46.500 71.8421 FALSE FALSE
## lstat 1.000 90.7895 FALSE FALSE
## medv 2.143 53.9474 FALSE FALSE
## resp 1.000 0.5263 FALSE FALSE
Cor <- cor(train[,-14])
Cor
## zn indus chas nox rm age dis
## zn 1.00000 -0.53549 -0.026473 -0.5250 0.31489 -0.5745 0.66451
## indus -0.53549 1.00000 0.074446 0.7778 -0.39419 0.6431 -0.69020
## chas -0.02647 0.07445 1.000000 0.1058 0.08437 0.0861 -0.09698
## nox -0.52497 0.77777 0.105766 1.0000 -0.30658 0.7456 -0.76693
## rm 0.31489 -0.39419 0.084365 -0.3066 1.00000 -0.2392 0.19477
## age -0.57449 0.64305 0.086100 0.7456 -0.23916 1.0000 -0.73289
## dis 0.66451 -0.69020 -0.096984 -0.7669 0.19477 -0.7329 1.00000
## rad -0.32958 0.64186 0.004018 0.6564 -0.25732 0.4872 -0.51689
## tax -0.34114 0.77401 -0.011177 0.7071 -0.34184 0.5382 -0.54999
## ptratio -0.40617 0.41163 -0.130817 0.2374 -0.39385 0.2881 -0.24587
## black 0.18651 -0.37516 0.065125 -0.3531 0.14337 -0.2740 0.28355
## lstat -0.41840 0.59965 -0.086266 0.6006 -0.63426 0.5834 -0.47855
## medv 0.34953 -0.46928 0.215232 -0.4203 0.72468 -0.3594 0.22007
## rad tax ptratio black lstat medv
## zn -0.329579 -0.34114 -0.4062 0.18651 -0.41840 0.3495
## indus 0.641863 0.77401 0.4116 -0.37516 0.59965 -0.4693
## chas 0.004018 -0.01118 -0.1308 0.06512 -0.08627 0.2152
## nox 0.656410 0.70706 0.2374 -0.35308 0.60060 -0.4203
## rm -0.257319 -0.34184 -0.3939 0.14337 -0.63426 0.7247
## age 0.487234 0.53819 0.2881 -0.27405 0.58340 -0.3594
## dis -0.516887 -0.54999 -0.2459 0.28355 -0.47855 0.2201
## rad 1.000000 0.91188 0.4770 -0.47595 0.54082 -0.3882
## tax 0.911883 1.00000 0.4871 -0.46785 0.60040 -0.4777
## ptratio 0.476980 0.48714 1.0000 -0.23353 0.42103 -0.5283
## black -0.475948 -0.46785 -0.2335 1.00000 -0.36606 0.3297
## lstat 0.540823 0.60040 0.4210 -0.36606 1.00000 -0.7436
## medv -0.388190 -0.47772 -0.5283 0.32967 -0.74364 1.0000
highCor <- findCorrelation(Cor, cutoff = 0.75)
highCor
## [1] 2 9 4
No variables are near zero variance. However, we see that industrial area and NOx are highly correlated with each other (and other variables), property tax is also highly correlated with other variables. For now we will remove the property tax variable and NOx.
train_cor <- train[,-drop(c(2,9))]
test_cor <- test[,-drop(c(2,9))]
Now let’s try a few models…
knnGrid <- expand.grid(.k=c(2))
# Use k = 2, since we expect 2 classes
KNN <- train(x=train_cor[,-12], method='knn',
y=train_cor$resp,
preProcess=c('center', 'scale'),
tuneGrid = knnGrid)
KNN
## k-Nearest Neighbors
##
## 380 samples
## 11 predictors
## 2 classes: 'No', 'Yes'
##
## Pre-processing: centered, scaled
## Resampling: Bootstrapped (25 reps)
##
## Summary of sample sizes: 380, 380, 380, 380, 380, 380, ...
##
## Resampling results
##
## Accuracy Kappa Accuracy SD Kappa SD
## 0.9 0.7 0.03 0.06
##
## Tuning parameter 'k' was held constant at a value of 2
##
confusionMatrix(predict(KNN, test_cor[,-12]), test_cor$resp)
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 57 9
## Yes 6 54
##
## Accuracy : 0.881
## 95% CI : (0.811, 0.932)
## No Information Rate : 0.5
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.762
## Mcnemar's Test P-Value : 0.606
##
## Sensitivity : 0.905
## Specificity : 0.857
## Pos Pred Value : 0.864
## Neg Pred Value : 0.900
## Prevalence : 0.500
## Detection Rate : 0.452
## Detection Prevalence : 0.524
## Balanced Accuracy : 0.881
##
## 'Positive' Class : No
##
This gives us pretty good results, however, let’s try using principal component analysis instead of removing correlated variables.
KNN <- train(x=train[,-14], method='knn',
y=train$resp,
preProcess=c('center', 'scale', 'pca'),
tuneGrid = knnGrid)
KNN
## k-Nearest Neighbors
##
## 380 samples
## 13 predictors
## 2 classes: 'No', 'Yes'
##
## Pre-processing: centered, scaled, principal component signal extraction
## Resampling: Bootstrapped (25 reps)
##
## Summary of sample sizes: 380, 380, 380, 380, 380, 380, ...
##
## Resampling results
##
## Accuracy Kappa Accuracy SD Kappa SD
## 0.9 0.7 0.03 0.06
##
## Tuning parameter 'k' was held constant at a value of 2
##
confusionMatrix(predict(KNN, test[,-14]), test$resp)
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 56 5
## Yes 7 58
##
## Accuracy : 0.905
## 95% CI : (0.84, 0.95)
## No Information Rate : 0.5
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.81
## Mcnemar's Test P-Value : 0.773
##
## Sensitivity : 0.889
## Specificity : 0.921
## Pos Pred Value : 0.918
## Neg Pred Value : 0.892
## Prevalence : 0.500
## Detection Rate : 0.444
## Detection Prevalence : 0.484
## Balanced Accuracy : 0.905
##
## 'Positive' Class : No
##
Indeed, this model is about 4% better.
logit <- train(resp~., data=train,
method='glm', family=binomial(link='logit'),
preProcess=c('scale', 'center'))
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(logit)
##
## Call:
## NULL
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.170 -0.125 0.000 0.001 3.461
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 3.2401 0.8371 3.87 0.00011 ***
## zn -2.2104 0.9356 -2.36 0.01815 *
## indus -0.0898 0.3733 -0.24 0.80982
## chas 0.3581 0.2507 1.43 0.15323
## nox 5.7468 1.0122 5.68 1.4e-08 ***
## rm -0.1818 0.6197 -0.29 0.76927
## age 0.3849 0.3878 0.99 0.32094
## dis 1.5544 0.5447 2.85 0.00432 **
## rad 7.2335 1.6490 4.39 1.2e-05 ***
## tax -1.6351 0.6203 -2.64 0.00839 **
## ptratio 0.7714 0.3277 2.35 0.01856 *
## black -0.9310 0.5499 -1.69 0.09045 .
## lstat 0.4968 0.4119 1.21 0.22772
## medv 1.4587 0.8121 1.80 0.07247 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 526.79 on 379 degrees of freedom
## Residual deviance: 143.52 on 366 degrees of freedom
## AIC: 171.5
##
## Number of Fisher Scoring iterations: 9
plot(logit$finalModel,)
confusionMatrix(predict(logit, test[,-14]), test$resp)
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 50 8
## Yes 13 55
##
## Accuracy : 0.833
## 95% CI : (0.757, 0.894)
## No Information Rate : 0.5
## P-Value [Acc > NIR] : 6.27e-15
##
## Kappa : 0.667
## Mcnemar's Test P-Value : 0.383
##
## Sensitivity : 0.794
## Specificity : 0.873
## Pos Pred Value : 0.862
## Neg Pred Value : 0.809
## Prevalence : 0.500
## Detection Rate : 0.397
## Detection Prevalence : 0.460
## Balanced Accuracy : 0.833
##
## 'Positive' Class : No
##
Most of these variables are actually not significant in the model, we might as well remove the non-significant ones.
library(car)
logit <- train(resp ~ rad + nox,
data=train,
method='glm', family=binomial(link='logit'),
preProcess=c('scale', 'center'))
summary(logit)
##
## Call:
## NULL
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.9882 -0.3293 -0.0245 0.0038 2.5937
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.922 0.607 4.81 1.5e-06 ***
## rad 4.903 1.017 4.82 1.4e-06 ***
## nox 3.282 0.429 7.66 1.9e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 526.79 on 379 degrees of freedom
## Residual deviance: 188.13 on 377 degrees of freedom
## AIC: 194.1
##
## Number of Fisher Scoring iterations: 8
plot(logit$finalModel, which=1)
vif(logit$finalModel)
## rad nox
## 1.136 1.136
confusionMatrix(predict(logit, test[,-14]), test$resp)
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 53 14
## Yes 10 49
##
## Accuracy : 0.81
## 95% CI : (0.73, 0.874)
## No Information Rate : 0.5
## P-Value [Acc > NIR] : 6.07e-13
##
## Kappa : 0.619
## Mcnemar's Test P-Value : 0.54
##
## Sensitivity : 0.841
## Specificity : 0.778
## Pos Pred Value : 0.791
## Neg Pred Value : 0.831
## Prevalence : 0.500
## Detection Rate : 0.421
## Detection Prevalence : 0.532
## Balanced Accuracy : 0.810
##
## 'Positive' Class : No
##
LDA <- train(resp~., data=train_cor,
method='lda',
preProcess=c('scale', 'center'))
LDA
## Linear Discriminant Analysis
##
## 380 samples
## 11 predictors
## 2 classes: 'No', 'Yes'
##
## Pre-processing: scaled, centered
## Resampling: Bootstrapped (25 reps)
##
## Summary of sample sizes: 380, 380, 380, 380, 380, 380, ...
##
## Resampling results
##
## Accuracy Kappa Accuracy SD Kappa SD
## 0.9 0.7 0.02 0.04
##
##
confusionMatrix(test_cor$resp, predict(LDA, test_cor[,-12]))
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 56 7
## Yes 16 47
##
## Accuracy : 0.817
## 95% CI : (0.739, 0.881)
## No Information Rate : 0.571
## P-Value [Acc > NIR] : 4.15e-09
##
## Kappa : 0.635
## Mcnemar's Test P-Value : 0.0953
##
## Sensitivity : 0.778
## Specificity : 0.870
## Pos Pred Value : 0.889
## Neg Pred Value : 0.746
## Prevalence : 0.571
## Detection Rate : 0.444
## Detection Prevalence : 0.500
## Balanced Accuracy : 0.824
##
## 'Positive' Class : No
##
QDA <- train(resp~., data=train_cor,
method='qda',
preProcess=c('scale', 'center'))
QDA
## Quadratic Discriminant Analysis
##
## 380 samples
## 11 predictors
## 2 classes: 'No', 'Yes'
##
## Pre-processing: scaled, centered
## Resampling: Bootstrapped (25 reps)
##
## Summary of sample sizes: 380, 380, 380, 380, 380, 380, ...
##
## Resampling results
##
## Accuracy Kappa Accuracy SD Kappa SD
## 0.9 0.7 0.03 0.05
##
##
confusionMatrix(test_cor$resp, predict(LDA, test_cor[,-12]))
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 56 7
## Yes 16 47
##
## Accuracy : 0.817
## 95% CI : (0.739, 0.881)
## No Information Rate : 0.571
## P-Value [Acc > NIR] : 4.15e-09
##
## Kappa : 0.635
## Mcnemar's Test P-Value : 0.0953
##
## Sensitivity : 0.778
## Specificity : 0.870
## Pos Pred Value : 0.889
## Neg Pred Value : 0.746
## Prevalence : 0.571
## Detection Rate : 0.444
## Detection Prevalence : 0.500
## Balanced Accuracy : 0.824
##
## 'Positive' Class : No
##
These results are slightly better than LDA, but less than Logistic regression.
library(doMC)
## Loading required package: foreach
## Loading required package: iterators
## Loading required package: parallel
registerDoMC(6)
fun <- train(y=train$resp, x=train[,-14])
## Loading required package: randomForest
## randomForest 4.6-7
## Type rfNews() to see new features/changes/bug fixes.
fun
## Random Forest
##
## 380 samples
## 13 predictors
## 2 classes: 'No', 'Yes'
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
##
## Summary of sample sizes: 380, 380, 380, 380, 380, 380, ...
##
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa Accuracy SD Kappa SD
## 2 1 0.9 0.01 0.02
## 7 1 0.9 0.01 0.02
## 10 1 0.9 0.01 0.03
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 7.
confusionMatrix(test$resp, predict(fun, test[,-14]))
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 59 4
## Yes 1 62
##
## Accuracy : 0.96
## 95% CI : (0.91, 0.987)
## No Information Rate : 0.524
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.921
## Mcnemar's Test P-Value : 0.371
##
## Sensitivity : 0.983
## Specificity : 0.939
## Pos Pred Value : 0.937
## Neg Pred Value : 0.984
## Prevalence : 0.476
## Detection Rate : 0.468
## Detection Prevalence : 0.500
## Balanced Accuracy : 0.961
##
## 'Positive' Class : No
##
One of the benefits of logistic regression is that we can begin to understand which variables are more important to the model, therefore we gain insight into the behavior of the real word phenomena. Well this isn’t the best model, we will use this simpler model as an example to explain the results.
# Getting odds ratio and 95% conf intervals
exp(cbind(OR = coef(logit$finalModel), confint(logit$finalModel)))
## Waiting for profiling to be done...
## OR 2.5 % 97.5 %
## (Intercept) 18.58 5.917 64.78
## rad 134.63 20.001 1099.06
## nox 26.64 12.277 66.69
From these results we can see that accessibility to highways (rad) is a very predominant factor for crime. For every one unit increase in rad, the odds of having crime > the median crime increases by a factor of 145. Also, for every one unit increase in NOx concentration, the odds of crime > median crime increases by 26. We can see very clearly what is happening here if we plot the data.
attach(Boston)
## The following objects are masked from Boston (position 10):
##
## age, black, chas, dis, indus, lstat, medv, nox, ptratio, rad,
## rm, tax, zn
plot(Boston$resp, rad)