This question should be answered using the Weekly data set, which is part of the ISLR2 package. This data is similar in nature to the Smarket data from this chapter’s lab, except that it contains 1, 089 weekly returns for 21 years, from the beginning of 1990 to the end of 2010.
weekly <- Weekly
summary(weekly)
## Year Lag1 Lag2 Lag3
## Min. :1990 Min. :-18.1950 Min. :-18.1950 Min. :-18.1950
## 1st Qu.:1995 1st Qu.: -1.1540 1st Qu.: -1.1540 1st Qu.: -1.1580
## Median :2000 Median : 0.2410 Median : 0.2410 Median : 0.2410
## Mean :2000 Mean : 0.1506 Mean : 0.1511 Mean : 0.1472
## 3rd Qu.:2005 3rd Qu.: 1.4050 3rd Qu.: 1.4090 3rd Qu.: 1.4090
## Max. :2010 Max. : 12.0260 Max. : 12.0260 Max. : 12.0260
## Lag4 Lag5 Volume Today
## Min. :-18.1950 Min. :-18.1950 Min. :0.08747 Min. :-18.1950
## 1st Qu.: -1.1580 1st Qu.: -1.1660 1st Qu.:0.33202 1st Qu.: -1.1540
## Median : 0.2380 Median : 0.2340 Median :1.00268 Median : 0.2410
## Mean : 0.1458 Mean : 0.1399 Mean :1.57462 Mean : 0.1499
## 3rd Qu.: 1.4090 3rd Qu.: 1.4050 3rd Qu.:2.05373 3rd Qu.: 1.4050
## Max. : 12.0260 Max. : 12.0260 Max. :9.32821 Max. : 12.0260
## Direction
## Down:484
## Up :605
##
##
##
##
pairs(weekly, col=weekly$Direction)
Looking at the summary statistics, Lag1 through Lag5 are similar with extremely close values for each individual statistic. We also see that there appears to be more weeks with an upwards direction, corresponding with the positive mean and median found in today.
When we examine the pairwise plots we see a strong relationship between Direction and Today. High values for today are Up and low values are Down; this corresponds with the fact that Direction is derived from Today. Another strong relationship we can see is as year increases, the Volume tends to increase as well.
weekly_predictors <- weekly[-c(1,8)] # selecting 5 lag plus volume
# up is 1, down is 0
weekly_predictors$Direction <- (ifelse(weekly_predictors$Direction == 'Up', 1, 0))
logistic_b <- glm(Direction ~ ., data=weekly_predictors, family = 'binomial')
summary(logistic_b)
##
## Call:
## glm(formula = Direction ~ ., family = "binomial", data = weekly_predictors)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.26686 0.08593 3.106 0.0019 **
## Lag1 -0.04127 0.02641 -1.563 0.1181
## Lag2 0.05844 0.02686 2.175 0.0296 *
## Lag3 -0.01606 0.02666 -0.602 0.5469
## Lag4 -0.02779 0.02646 -1.050 0.2937
## Lag5 -0.01447 0.02638 -0.549 0.5833
## Volume -0.02274 0.03690 -0.616 0.5377
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1496.2 on 1088 degrees of freedom
## Residual deviance: 1486.4 on 1082 degrees of freedom
## AIC: 1500.4
##
## Number of Fisher Scoring iterations: 4
Using logistic regression we see the only significant variable is Lag2.
# confusion matrix
predicted_2c <- predict(logistic_b, weekly, type='response')
predicted_2c <- ifelse(predicted_2c >= 0.5, 1, 0)
confusionMatrix(as.factor(predicted_2c), as.factor(weekly_predictors$Direction), mode='everything',positive='1')
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 54 48
## 1 430 557
##
## Accuracy : 0.5611
## 95% CI : (0.531, 0.5908)
## No Information Rate : 0.5556
## P-Value [Acc > NIR] : 0.369
##
## Kappa : 0.035
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.9207
## Specificity : 0.1116
## Pos Pred Value : 0.5643
## Neg Pred Value : 0.5294
## Precision : 0.5643
## Recall : 0.9207
## F1 : 0.6997
## Prevalence : 0.5556
## Detection Rate : 0.5115
## Detection Prevalence : 0.9063
## Balanced Accuracy : 0.5161
##
## 'Positive' Class : 1
##
There is a sensitivity of 92% and a specificity of 11%. This means it detects true positives (a direction of up) 92% of the time and true negatives (a direction of down) 11% of the time. The confusion matrix shows a positive predictive value of 56% and a negative predictive value of 52%. The F1 score is 0.70, and the overall accuracy is 0.56.
From this we can tell that our model is much better at predicting when the market is up rather than when it is down. However, this is slightly deceptive; looking at the PPV we only see a meager 56% meaning only 56% of its predictions of an upward market are correct. Going back to the confusion matrix we can see this is because around 90% of its predictions are that the market will be up. Looking at the data, only 55% of the time the market is up. We can see that the model mistakenly predicts the market will be up a majority of the time, giving it a high sensitivity but mediocre scores on other metrics.
weekly_before_2008 = weekly[weekly$Year <= 2008,]
weekly_after_2008 = weekly[weekly$Year > 2008,]
logistic_d <- glm(Direction ~ Lag2, data=weekly_before_2008, family = 'binomial')
predicted_d <- predict(logistic_d, newdata=weekly_after_2008,type='response')
predicted_d <- ifelse(predicted_d >= 0.5, 'Up', 'Down')
# compute confusion matrix
cm.lr <- confusionMatrix(as.factor(predicted_d), as.factor(weekly_after_2008$Direction), mode='everything',positive='Up')
print(cm.lr)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Down Up
## Down 9 5
## Up 34 56
##
## Accuracy : 0.625
## 95% CI : (0.5247, 0.718)
## No Information Rate : 0.5865
## P-Value [Acc > NIR] : 0.2439
##
## Kappa : 0.1414
##
## Mcnemar's Test P-Value : 7.34e-06
##
## Sensitivity : 0.9180
## Specificity : 0.2093
## Pos Pred Value : 0.6222
## Neg Pred Value : 0.6429
## Precision : 0.6222
## Recall : 0.9180
## F1 : 0.7417
## Prevalence : 0.5865
## Detection Rate : 0.5385
## Detection Prevalence : 0.8654
## Balanced Accuracy : 0.5637
##
## 'Positive' Class : Up
##
After training on 2008 and before then testing on 2009 an 2010, we calculate an accuracy (the fraction of correct predictions) of 62.5%, an improvement over what we found in part C. (e) Repeat (d) using LDA.
lda_fit <- lda(Direction ~ Lag2, data=weekly_before_2008)
predicted_lda <- predict(lda_fit, newdata=weekly_after_2008)
# compute confusion matrix
cm.lda <- confusionMatrix(as.factor(predicted_lda$class), as.factor(weekly_after_2008$Direction), mode='everything',positive='Up')
print(cm.lda)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Down Up
## Down 9 5
## Up 34 56
##
## Accuracy : 0.625
## 95% CI : (0.5247, 0.718)
## No Information Rate : 0.5865
## P-Value [Acc > NIR] : 0.2439
##
## Kappa : 0.1414
##
## Mcnemar's Test P-Value : 7.34e-06
##
## Sensitivity : 0.9180
## Specificity : 0.2093
## Pos Pred Value : 0.6222
## Neg Pred Value : 0.6429
## Precision : 0.6222
## Recall : 0.9180
## F1 : 0.7417
## Prevalence : 0.5865
## Detection Rate : 0.5385
## Detection Prevalence : 0.8654
## Balanced Accuracy : 0.5637
##
## 'Positive' Class : Up
##
Using LDA to predict for 2009 and 2010 gives us an accuracy of 62.5%. This performance is identical to logistic regression using Lag2.
qda_fit <- qda(Direction ~ Lag2, data=weekly_before_2008)
predicted_qda <- predict(qda_fit, newdata=weekly_after_2008)
# compute confusion matrix
cm.qda <- confusionMatrix(as.factor(predicted_qda$class), as.factor(weekly_after_2008$Direction), mode='everything',positive='Up')
print(cm.qda)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Down Up
## Down 0 0
## Up 43 61
##
## Accuracy : 0.5865
## 95% CI : (0.4858, 0.6823)
## No Information Rate : 0.5865
## P-Value [Acc > NIR] : 0.5419
##
## Kappa : 0
##
## Mcnemar's Test P-Value : 1.504e-10
##
## Sensitivity : 1.0000
## Specificity : 0.0000
## Pos Pred Value : 0.5865
## Neg Pred Value : NaN
## Precision : 0.5865
## Recall : 1.0000
## F1 : 0.7394
## Prevalence : 0.5865
## Detection Rate : 0.5865
## Detection Prevalence : 1.0000
## Balanced Accuracy : 0.5000
##
## 'Positive' Class : Up
##
Using QDA we get an accuracy of 58.65%. It is important to note that the sensitivity of 100% is somewhat misleading as to the performance of the model, as it is predicting ‘Up’ for every possible prediction. This inflates the sensitivity as it will always correctly predict a positive value if it always predicts a positive value.
knn_predicted <- knn(train=weekly_before_2008[3], test=weekly_after_2008[3], cl = weekly_before_2008[,9], k=1)
# compute confusion matrix
cm.knn <- confusionMatrix(as.factor(knn_predicted), as.factor(weekly_after_2008[,9]), mode='everything',positive='Up')
print(cm.knn)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Down Up
## Down 21 29
## Up 22 32
##
## Accuracy : 0.5096
## 95% CI : (0.4097, 0.609)
## No Information Rate : 0.5865
## P-Value [Acc > NIR] : 0.9540
##
## Kappa : 0.0127
##
## Mcnemar's Test P-Value : 0.4008
##
## Sensitivity : 0.5246
## Specificity : 0.4884
## Pos Pred Value : 0.5926
## Neg Pred Value : 0.4200
## Precision : 0.5926
## Recall : 0.5246
## F1 : 0.5565
## Prevalence : 0.5865
## Detection Rate : 0.3077
## Detection Prevalence : 0.5192
## Balanced Accuracy : 0.5065
##
## 'Positive' Class : Up
##
Using KNN with k=1 we find an accuracy of 51%, which so far is the worst performing model.
nb_fit <- naiveBayes(Direction ~ Lag2, data=weekly_before_2008)
pred_nb <- predict(nb_fit, newdata = weekly_after_2008, type = "raw")
pred_nb <- ifelse(pred_nb[,2] >= 0.5, 'Up', 'Down') # column 2 represents likelihood of 'Up'
cm.nb <- confusionMatrix(as.factor(pred_nb), as.factor(weekly_after_2008$Direction),mode='everything',positive='Up')
## Warning in confusionMatrix.default(as.factor(pred_nb),
## as.factor(weekly_after_2008$Direction), : Levels are not in the same order for
## reference and data. Refactoring data to match.
print(cm.nb)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Down Up
## Down 0 0
## Up 43 61
##
## Accuracy : 0.5865
## 95% CI : (0.4858, 0.6823)
## No Information Rate : 0.5865
## P-Value [Acc > NIR] : 0.5419
##
## Kappa : 0
##
## Mcnemar's Test P-Value : 1.504e-10
##
## Sensitivity : 1.0000
## Specificity : 0.0000
## Pos Pred Value : 0.5865
## Neg Pred Value : NaN
## Precision : 0.5865
## Recall : 1.0000
## F1 : 0.7394
## Prevalence : 0.5865
## Detection Rate : 0.5865
## Detection Prevalence : 1.0000
## Balanced Accuracy : 0.5000
##
## 'Positive' Class : Up
##
We see an accuracy of 58.65% using naive bayes, which is the same as QDA. We also see the same issue with QDA, in which the model has a perfect sensitivity of 100% by predicting everything as up.
The best performing models using Lag2 were LDA and logistic regression, both of which provided the exact same results with an overall accuracy of 62.5%.
We will experiment with different subsets for each model; these subsets include all lag variables, all lag and today, and all lag with today and volume. However, we will not conduct further investigation with logistic regression as Lag2 was the only term shown to be significant.
## [1] LDA with lag 1, 2, and 3:
## Reference
## Prediction Down Up
## Down 8 9
## Up 35 52
## Accuracy
## 0.5769231
## Sensitivity Specificity Pos Pred Value
## 0.8524590 0.1860465 0.5977011
## Neg Pred Value Precision Recall
## 0.4705882 0.5977011 0.8524590
## F1 Prevalence Detection Rate
## 0.7027027 0.5865385 0.5000000
## Detection Prevalence Balanced Accuracy
## 0.8365385 0.5192528
## [1] LDA with all lag:
## Reference
## Prediction Down Up
## Down 9 13
## Up 34 48
## Accuracy
## 0.5480769
## Sensitivity Specificity Pos Pred Value
## 0.7868852 0.2093023 0.5853659
## Neg Pred Value Precision Recall
## 0.4090909 0.5853659 0.7868852
## F1 Prevalence Detection Rate
## 0.6713287 0.5865385 0.4615385
## Detection Prevalence Balanced Accuracy
## 0.7884615 0.4980938
## [1] LDA with all lag and volume:
## Reference
## Prediction Down Up
## Down 31 44
## Up 12 17
## Accuracy
## 0.4615385
## Sensitivity Specificity Pos Pred Value
## 0.2786885 0.7209302 0.5862069
## Neg Pred Value Precision Recall
## 0.4133333 0.5862069 0.2786885
## F1 Prevalence Detection Rate
## 0.3777778 0.5865385 0.1634615
## Detection Prevalence Balanced Accuracy
## 0.2788462 0.4998094
We see that the best performance for LDA is using only Lag2, which has an accuracy of 62.5% which is higher than all other LDA models shown. This appears to get worse as more variables are added; the one exception is specificity, which increases to 72% with all variables compared to 21% with only lag2.
## [1] QDA with lag 1, 2, and 3:
## Reference
## Prediction Down Up
## Down 6 10
## Up 37 51
## Accuracy
## 0.5480769
## Sensitivity Specificity Pos Pred Value
## 0.8360656 0.1395349 0.5795455
## Neg Pred Value Precision Recall
## 0.3750000 0.5795455 0.8360656
## F1 Prevalence Detection Rate
## 0.6845638 0.5865385 0.4903846
## Detection Prevalence Balanced Accuracy
## 0.8461538 0.4878002
## [1] QDA with all lag:
## Reference
## Prediction Down Up
## Down 10 23
## Up 33 38
## Accuracy
## 0.4615385
## Sensitivity Specificity Pos Pred Value
## 0.6229508 0.2325581 0.5352113
## Neg Pred Value Precision Recall
## 0.3030303 0.5352113 0.6229508
## F1 Prevalence Detection Rate
## 0.5757576 0.5865385 0.3653846
## Detection Prevalence Balanced Accuracy
## 0.6826923 0.4277545
## [1] QDA with all lag and volume:
## Reference
## Prediction Down Up
## Down 33 49
## Up 10 12
## Accuracy
## 0.4326923
## Sensitivity Specificity Pos Pred Value
## 0.1967213 0.7674419 0.5454545
## Neg Pred Value Precision Recall
## 0.4024390 0.5454545 0.1967213
## F1 Prevalence Detection Rate
## 0.2891566 0.5865385 0.1153846
## Detection Prevalence Balanced Accuracy
## 0.2115385 0.4820816
Similarly, we see the overall accuracy decreases with the addition of terms for QDA, as it starts from 59% and decreases to a low of 43%. The sensitivity of 100% also decreases to a measly 20%. Interestingly, the specificity also increases from 0% to 76%.
Next we will perform KNN for the various subsets for k values of 1, 3, 5, 10, 15, 20, 25, and 30. To make things easier to read we will only compare the overall accuracy of each.
## [1] KNN with lag2:
## [1] For k=1
## Accuracy
## 0.5
## [1] For k=3
## Accuracy
## 0.5384615
## [1] For k=5
## Accuracy
## 0.5384615
## [1] For k=10
## Accuracy
## 0.5865385
## [1] For k=15
## Accuracy
## 0.5865385
## [1] For k=20
## Accuracy
## 0.5576923
## [1] For k=25
## Accuracy
## 0.5480769
## [1] For k=30
## Accuracy
## 0.5288462
## [1] KNN with lag 1,2,3:
## [1] For k=1
## Accuracy
## 0.4903846
## [1] For k=3
## Accuracy
## 0.5096154
## [1] For k=5
## Accuracy
## 0.5288462
## [1] For k=10
## Accuracy
## 0.5769231
## [1] For k=15
## Accuracy
## 0.5865385
## [1] For k=20
## Accuracy
## 0.6153846
## [1] For k=25
## Accuracy
## 0.6153846
## [1] For k=30
## Accuracy
## 0.6153846
## [1] KNN with all lag:
## [1] For k=1
## Accuracy
## 0.5192308
## [1] For k=3
## Accuracy
## 0.5576923
## [1] For k=5
## Accuracy
## 0.5480769
## [1] For k=10
## Accuracy
## 0.6153846
## [1] For k=15
## Accuracy
## 0.5384615
## [1] For k=20
## Accuracy
## 0.5673077
## [1] For k=25
## Accuracy
## 0.5
## [1] For k=30
## Accuracy
## 0.5096154
## [1] KNN with all lag, volume:
## [1] For k=1
## Accuracy
## 0.4807692
## [1] For k=3
## Accuracy
## 0.5096154
## [1] For k=5
## Accuracy
## 0.5
## [1] For k=10
## Accuracy
## 0.5192308
## [1] For k=15
## Accuracy
## 0.5576923
## [1] For k=20
## Accuracy
## 0.5192308
## [1] For k=25
## Accuracy
## 0.5480769
## [1] For k=30
## Accuracy
## 0.5480769
## [1] KNN with lag 1, 2, and 3, and k=25:
## Confusion Matrix and Statistics
##
## Reference
## Prediction Down Up
## Down 22 19
## Up 21 42
##
## Accuracy : 0.6154
## 95% CI : (0.5149, 0.7091)
## No Information Rate : 0.5865
## P-Value [Acc > NIR] : 0.3110
##
## Kappa : 0.2015
##
## Mcnemar's Test P-Value : 0.8744
##
## Sensitivity : 0.6885
## Specificity : 0.5116
## Pos Pred Value : 0.6667
## Neg Pred Value : 0.5366
## Precision : 0.6667
## Recall : 0.6885
## F1 : 0.6774
## Prevalence : 0.5865
## Detection Rate : 0.4038
## Detection Prevalence : 0.6058
## Balanced Accuracy : 0.6001
##
## 'Positive' Class : Up
##
A k=10 seems to have the best performance overall, seeming to reach a peak accuracy of 61% when using all lag variables as predictors. Similar performance is reached by k=20, 25, and 30 with lag 1, 2, and 3. Overall, it appears that adding predictors helps until the point where you have all lag variables, where the performance starts to decline. The worst performance is with all lag and volume. KNN are the only models which appear to improve with the addition of variables, but all models including KNN still suffer decreases in performances when all predictors are used.
## [1] NB with lag 1,2,3:
## Reference
## Prediction Down Up
## Down 5 10
## Up 38 51
## Accuracy
## 0.5384615
## Sensitivity Specificity Pos Pred Value
## 0.8360656 0.1162791 0.5730337
## Neg Pred Value Precision Recall
## 0.3333333 0.5730337 0.8360656
## F1 Prevalence Detection Rate
## 0.6800000 0.5865385 0.4903846
## Detection Prevalence Balanced Accuracy
## 0.8557692 0.4761723
## [1] NB with all lag:
## Reference
## Prediction Down Up
## Down 10 21
## Up 33 40
## Accuracy
## 0.4807692
## Sensitivity Specificity Pos Pred Value
## 0.6557377 0.2325581 0.5479452
## Neg Pred Value Precision Recall
## 0.3225806 0.5479452 0.6557377
## F1 Prevalence Detection Rate
## 0.5970149 0.5865385 0.3846154
## Detection Prevalence Balanced Accuracy
## 0.7019231 0.4441479
## [1] NB with all lag, volume:
## Reference
## Prediction Down Up
## Down 42 56
## Up 1 5
## Accuracy
## 0.4519231
## Sensitivity Specificity Pos Pred Value
## 0.08196721 0.97674419 0.83333333
## Neg Pred Value Precision Recall
## 0.42857143 0.83333333 0.08196721
## F1 Prevalence Detection Rate
## 0.14925373 0.58653846 0.04807692
## Detection Prevalence Balanced Accuracy
## 0.05769231 0.52935570
We see a similar pattern to what occured in LDA and QDA for NB. The accuracy steadily decreased from 59% at just lag2 down to 45% to all lag and volume. We see a similar phenomenon as we saw in QDA where the initial sensitivity of 100% decreased to 8% as the model stopped predicting purely positive responses, and a corresponding increase of specificity from 0% to 98%.
## [1] Logistic regression:
## Confusion Matrix and Statistics
##
## Reference
## Prediction Down Up
## Down 9 5
## Up 34 56
##
## Accuracy : 0.625
## 95% CI : (0.5247, 0.718)
## No Information Rate : 0.5865
## P-Value [Acc > NIR] : 0.2439
##
## Kappa : 0.1414
##
## Mcnemar's Test P-Value : 7.34e-06
##
## Sensitivity : 0.9180
## Specificity : 0.2093
## Pos Pred Value : 0.6222
## Neg Pred Value : 0.6429
## Precision : 0.6222
## Recall : 0.9180
## F1 : 0.7417
## Prevalence : 0.5865
## Detection Rate : 0.5385
## Detection Prevalence : 0.8654
## Balanced Accuracy : 0.5637
##
## 'Positive' Class : Up
##
## [1] LDA:
## Confusion Matrix and Statistics
##
## Reference
## Prediction Down Up
## Down 9 5
## Up 34 56
##
## Accuracy : 0.625
## 95% CI : (0.5247, 0.718)
## No Information Rate : 0.5865
## P-Value [Acc > NIR] : 0.2439
##
## Kappa : 0.1414
##
## Mcnemar's Test P-Value : 7.34e-06
##
## Sensitivity : 0.9180
## Specificity : 0.2093
## Pos Pred Value : 0.6222
## Neg Pred Value : 0.6429
## Precision : 0.6222
## Recall : 0.9180
## F1 : 0.7417
## Prevalence : 0.5865
## Detection Rate : 0.5385
## Detection Prevalence : 0.8654
## Balanced Accuracy : 0.5637
##
## 'Positive' Class : Up
##
After conducting a further analysis, we see that the best performance was using logistic regression and LDA for only Lag2 as a predictor, which achieved an accuracy of 62.5%. Adding any extra predictors only served to decrease the performance of models.
KNN with k=10 and all lag predictors was the closest as it did see some improvement when adding extra variables, but was only able to get an accuracy of 61%.
In this problem, you will develop a model to predict whether a given car gets high or low gas mileage based on the Auto data set.
auto <- Auto
auto$mpg01 <- ifelse(auto$mpg >= median(auto$mpg), 1, 0)
par(mfrow = c(2,2))
# plot mpg01 versus every other predictor
sapply(names(auto)[-(9:10)], function(col){
x <- auto[[col]]
plot(auto$mpg01, x, main = paste("mpg01 vs", col), xlab = "mpg01", ylab = col)
boxplot(x ~ auto$mpg01, main = paste("Boxplot of", col, "by mpg01"), xlab = "mpg01", ylab = col)
})
## mpg cylinders displacement horsepower weight acceleration
## stats numeric,10 numeric,10 numeric,10 numeric,10 numeric,10 numeric,10
## n numeric,2 numeric,2 numeric,2 numeric,2 numeric,2 numeric,2
## conf numeric,4 numeric,4 numeric,4 numeric,4 numeric,4 numeric,4
## out numeric,3 numeric,17 numeric,5 numeric,3 numeric,3 numeric,5
## group numeric,3 numeric,17 numeric,5 numeric,3 numeric,3 numeric,5
## names character,2 character,2 character,2 character,2 character,2 character,2
## year origin
## stats numeric,10 numeric,10
## n numeric,2 numeric,2
## conf numeric,4 numeric,4
## out numeric,0 numeric,23
## group numeric,0 numeric,23
## names character,2 character,2
Excluding the mpg, it seems likely that the variables that will be associated with mpg01 are cylinders, displacement, horsepower, weight, and origin.
set.seed(123)
trainIndex <- createDataPartition(auto$mpg01, p = 0.8, list = F) # 80% training and 20% test
train_auto <- auto[trainIndex,]
test_auto <- auto[-trainIndex,]
lda <- lda(mpg01 ~ cylinders + displacement + horsepower + weight + origin, data = train_auto)
lda_pred <- predict(lda, test_auto)$class
cm <- confusionMatrix(as.factor(test_auto$mpg01), lda_pred)
cm
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 31 8
## 1 1 38
##
## Accuracy : 0.8846
## 95% CI : (0.7922, 0.9459)
## No Information Rate : 0.5897
## P-Value [Acc > NIR] : 1.097e-08
##
## Kappa : 0.7692
##
## Mcnemar's Test P-Value : 0.0455
##
## Sensitivity : 0.9688
## Specificity : 0.8261
## Pos Pred Value : 0.7949
## Neg Pred Value : 0.9744
## Prevalence : 0.4103
## Detection Rate : 0.3974
## Detection Prevalence : 0.5000
## Balanced Accuracy : 0.8974
##
## 'Positive' Class : 0
##
After using LDA we see the test error metric of accuracy is at 0.88, with a sensitivity of 0.97 and specificity of 0.83.
qda <- qda(mpg01 ~ cylinders + displacement + horsepower + weight + origin, data = train_auto)
qda_pred <- predict(qda, test_auto)$class
cm <- confusionMatrix(as.factor(test_auto$mpg01), qda_pred)
cm
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 33 6
## 1 2 37
##
## Accuracy : 0.8974
## 95% CI : (0.8079, 0.9547)
## No Information Rate : 0.5513
## P-Value [Acc > NIR] : 3.51e-11
##
## Kappa : 0.7949
##
## Mcnemar's Test P-Value : 0.2888
##
## Sensitivity : 0.9429
## Specificity : 0.8605
## Pos Pred Value : 0.8462
## Neg Pred Value : 0.9487
## Prevalence : 0.4487
## Detection Rate : 0.4231
## Detection Prevalence : 0.5000
## Balanced Accuracy : 0.9017
##
## 'Positive' Class : 0
##
QDA gives us an accuracy of 0.90 with a sensitivity of 0.94 and specificity of 0.86. It seems that in comparison to LDA there is a slight sacrifice of sensitivity to give a slightly higher accuracy and specificity.
lr <- glm(mpg01 ~ cylinders + displacement + horsepower + weight + origin, data = train_auto, family = 'binomial')
lr_pred<- predict(lr, test_auto)
lr_pred <- ifelse(lr_pred >= 0.5, 1, 0)
cm <- confusionMatrix(as.factor(test_auto$mpg01), as.factor(lr_pred))
cm
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 34 5
## 1 2 37
##
## Accuracy : 0.9103
## 95% CI : (0.8238, 0.9632)
## No Information Rate : 0.5385
## P-Value [Acc > NIR] : 1.083e-12
##
## Kappa : 0.8205
##
## Mcnemar's Test P-Value : 0.4497
##
## Sensitivity : 0.9444
## Specificity : 0.8810
## Pos Pred Value : 0.8718
## Neg Pred Value : 0.9487
## Prevalence : 0.4615
## Detection Rate : 0.4359
## Detection Prevalence : 0.5000
## Balanced Accuracy : 0.9127
##
## 'Positive' Class : 0
##
Logistic regression gives us an accuracy of 0.91, with a sensitivity of 0.94 and specificity of 0.88. This is the highest accuracy and specificity so far, with the same sensitivity as QDA.
nb <- naiveBayes(mpg01 ~ cylinders + displacement + horsepower + weight + origin, data = train_auto)
nb_pred <- predict(nb, test_auto)
cm <- confusionMatrix(as.factor(test_auto$mpg01), as.factor(nb_pred))
cm
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 31 8
## 1 1 38
##
## Accuracy : 0.8846
## 95% CI : (0.7922, 0.9459)
## No Information Rate : 0.5897
## P-Value [Acc > NIR] : 1.097e-08
##
## Kappa : 0.7692
##
## Mcnemar's Test P-Value : 0.0455
##
## Sensitivity : 0.9688
## Specificity : 0.8261
## Pos Pred Value : 0.7949
## Neg Pred Value : 0.9744
## Prevalence : 0.4103
## Detection Rate : 0.3974
## Detection Prevalence : 0.5000
## Balanced Accuracy : 0.8974
##
## 'Positive' Class : 0
##
Naive Bayes gives the exact same results as LDA, with an accuracy of 0.88, sensitivity of 0.97, and specificity of 0.83.
We will use the following variables for this problem:
cylinders + displacement + horsepower + weight + origin
set.seed(123)
# first select the values that are being tested
knn_train <- train_auto[,c(2:5, 8)]
knn_test <- test_auto[,c(2:5, 8)]
knn_train_cl <- train_auto$mpg01 # correct classifications for training
knn_test_cl <- test_auto$mpg01 # for testing
# from k = 1 to k = 20
k_list <- 1:20
knn_results <- lapply(k_list, function(k){
knn_pred <- knn(knn_train, knn_test, knn_train_cl, k = k)
knn_cm <- confusionMatrix(as.factor(knn_pred), as.factor(knn_test_cl))
return(list(k = k, cm = knn_cm))
})
We see the highest accuracy ratings to be at around 0.91 at k=8 and k=12. For these two we obtain very similar values of sensitivity at around 0.85 and specificity at around 0.98. Using these two k values we see a very similar performance to logistic regression, with a higher specificity and lower sensitivity.
Using the Boston data set, fit classification models in order to predict whether a given census tract has a crime rate above or below the median. Explore logistic regression, LDA, naive Bayes, and KNN models using various subsets of the predictors. Describe your findings.
Hint: You will have to create the response variable yourself, using the variables that are contained in the Boston data set.
set.seed(123)
# creating the response variable
med.crim <- median(Boston$crim)
# 0 for lower crim than median, 1 for higher
is.high.crime <- apply(Boston, 1, function(x) {
ifelse(x['crim'] > med.crim, 1,0) # median will be considered low crim, if exists in data
})
boston.crime <- Boston[-1] # remove crim from dataset
boston.crime$high.crime <- is.high.crime
After creating the response variable, we next decide on what subsets to use. The first model we will use StepAIC with stepwise regression to find the optimal logistic regression model.
# logistic regression
full.model <- glm(high.crime~. , data=boston.crime, family='binomial')
null.model <- glm(high.crime~1, data=boston.crime, family='binomial')
model.selection <- stepAIC(null.model, direction = 'both', scope=list(lower=null.model, upper=full.model), trace = F)
summary(model.selection)
##
## Call:
## glm(formula = high.crime ~ nox + rad + tax + zn + black + dis +
## ptratio + medv + age, family = "binomial", data = boston.crime)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -31.441272 6.048989 -5.198 2.02e-07 ***
## nox 43.195824 6.452812 6.694 2.17e-11 ***
## rad 0.718773 0.142066 5.059 4.21e-07 ***
## tax -0.007676 0.002503 -3.066 0.00217 **
## zn -0.082567 0.031424 -2.628 0.00860 **
## black -0.012866 0.006334 -2.031 0.04224 *
## dis 0.634380 0.207634 3.055 0.00225 **
## ptratio 0.303502 0.109255 2.778 0.00547 **
## medv 0.112882 0.034362 3.285 0.00102 **
## age 0.022851 0.009894 2.310 0.02091 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 701.46 on 505 degrees of freedom
## Residual deviance: 216.22 on 496 degrees of freedom
## AIC: 236.22
##
## Number of Fisher Scoring iterations: 9
As we see, the optimal LR model is using the predictors as follows: nox + rad + tax + zn + black + dis + ptratio + medv + age
For the different subsets used to predict whether a census tract has a higher than the median crime level, we will use these predictors in that order. For example, we will first use LR with nox, then nox + rad, then nox + rad + tax, and so on. We will do this for LR, LDA, NB, and KNN (with k values of 1, 3, 5, 10, 15, 20, 25, and 30). The metrics we will use to compare will be accuracy, sensitivity, specificity, and F1 score. We will then compare these metrics visually at the end.
# creating test and training dataset
tr.size = floor(0.7*nrow(boston.crime))
tr.ind = sample(seq_len(nrow(boston.crime)),tr.size)
boston.tr = boston.crime[tr.ind,]
# table(boston.tr$high.crime)
boston.te = boston.crime[!(rownames(boston.crime) %in% tr.ind),]
# table(boston.te$high.crime)
Now that we have the desired subsets and have split the data into the testing and training sets, we can proceed with fitting the models.
In terms of sensitivity, we see the KNN models performed significantly better than the other models; the closest performer was logistic regression which appeared to have performed worse than most KNN models. Of the KNN models, the best performer was k=1 for all amounts of predictors, which had a max sensitivity of around 93%.
For specificity we see knn has the best overall performance, particularly at k=1 for 2 to 4 predictors with an estimated specificity of 98%. The model with k=3 appears to come close for 2, 4, and 8 predictors as well.
For accuracy, we again see that KNN models perform better with the best model of k=1 achieving an accuracy of 96% at 2 to 4 predictors. LR also performs well and approaches k=1 at 8 predictors with an accuracy just below 93%, but is ultimately outperformed.
We see a very similar picture with the F1 score as we saw in the accuracy. KNN with k=1 performs the best at 4 predictors with an F1 of around 97%, but LR gets close at 8 predictors.
Overall, the best performing model was KNN with k=1 and 4 predictors (nox + rad + tax + zn), as it outperformed every model in all metrics. There is concern for overfitting however, as there was no split between the training and the validation test set.
While LR was outperformed by KNN at k=1, the optimal model of 8 predictors still performed relatively well at all metrics, around 5 to 7 percentage points behind the optimal k=1 model. Given the likely situation in which the KNN model overfit, LR would be a solid contender for the most optimal model.