Question 10: This question should be answered using the Weekly data set, which is part of the ISLR package. This data is similar in nature to the Smarket data from this chapter’s lab, except that it contains 1, 089 weekly returns for 21 years, from the beginning of 1990 to the end of 2010.
Load the data set and packages
library(ISLR)
## Warning: package 'ISLR' was built under R version 4.0.3
library(MASS)
library(corrplot)
## Warning: package 'corrplot' was built under R version 4.0.3
## corrplot 0.84 loaded
library(ggplot2)
library(caret)
## Warning: package 'caret' was built under R version 4.0.3
## Loading required package: lattice
library(class)
data(Weekly)
head(Weekly)
## Year Lag1 Lag2 Lag3 Lag4 Lag5 Volume Today Direction
## 1 1990 0.816 1.572 -3.936 -0.229 -3.484 0.1549760 -0.270 Down
## 2 1990 -0.270 0.816 1.572 -3.936 -0.229 0.1485740 -2.576 Down
## 3 1990 -2.576 -0.270 0.816 1.572 -3.936 0.1598375 3.514 Up
## 4 1990 3.514 -2.576 -0.270 0.816 1.572 0.1616300 0.712 Up
## 5 1990 0.712 3.514 -2.576 -0.270 0.816 0.1537280 1.178 Up
## 6 1990 1.178 0.712 3.514 -2.576 -0.270 0.1544440 -1.372 Down
str(Weekly)
## 'data.frame': 1089 obs. of 9 variables:
## $ Year : num 1990 1990 1990 1990 1990 1990 1990 1990 1990 1990 ...
## $ Lag1 : num 0.816 -0.27 -2.576 3.514 0.712 ...
## $ Lag2 : num 1.572 0.816 -0.27 -2.576 3.514 ...
## $ Lag3 : num -3.936 1.572 0.816 -0.27 -2.576 ...
## $ Lag4 : num -0.229 -3.936 1.572 0.816 -0.27 ...
## $ Lag5 : num -3.484 -0.229 -3.936 1.572 0.816 ...
## $ Volume : num 0.155 0.149 0.16 0.162 0.154 ...
## $ Today : num -0.27 -2.576 3.514 0.712 1.178 ...
## $ Direction: Factor w/ 2 levels "Down","Up": 1 1 2 2 2 1 2 2 2 1 ...
summary(Weekly)
## Year Lag1 Lag2 Lag3
## Min. :1990 Min. :-18.1950 Min. :-18.1950 Min. :-18.1950
## 1st Qu.:1995 1st Qu.: -1.1540 1st Qu.: -1.1540 1st Qu.: -1.1580
## Median :2000 Median : 0.2410 Median : 0.2410 Median : 0.2410
## Mean :2000 Mean : 0.1506 Mean : 0.1511 Mean : 0.1472
## 3rd Qu.:2005 3rd Qu.: 1.4050 3rd Qu.: 1.4090 3rd Qu.: 1.4090
## Max. :2010 Max. : 12.0260 Max. : 12.0260 Max. : 12.0260
## Lag4 Lag5 Volume Today
## Min. :-18.1950 Min. :-18.1950 Min. :0.08747 Min. :-18.1950
## 1st Qu.: -1.1580 1st Qu.: -1.1660 1st Qu.:0.33202 1st Qu.: -1.1540
## Median : 0.2380 Median : 0.2340 Median :1.00268 Median : 0.2410
## Mean : 0.1458 Mean : 0.1399 Mean :1.57462 Mean : 0.1499
## 3rd Qu.: 1.4090 3rd Qu.: 1.4050 3rd Qu.:2.05373 3rd Qu.: 1.4050
## Max. : 12.0260 Max. : 12.0260 Max. :9.32821 Max. : 12.0260
## Direction
## Down:484
## Up :605
##
##
##
##
pairs(Weekly)
correlation<-cor(Weekly[-9])
correlation
## Year Lag1 Lag2 Lag3 Lag4
## Year 1.00000000 -0.032289274 -0.03339001 -0.03000649 -0.031127923
## Lag1 -0.03228927 1.000000000 -0.07485305 0.05863568 -0.071273876
## Lag2 -0.03339001 -0.074853051 1.00000000 -0.07572091 0.058381535
## Lag3 -0.03000649 0.058635682 -0.07572091 1.00000000 -0.075395865
## Lag4 -0.03112792 -0.071273876 0.05838153 -0.07539587 1.000000000
## Lag5 -0.03051910 -0.008183096 -0.07249948 0.06065717 -0.075675027
## Volume 0.84194162 -0.064951313 -0.08551314 -0.06928771 -0.061074617
## Today -0.03245989 -0.075031842 0.05916672 -0.07124364 -0.007825873
## Lag5 Volume Today
## Year -0.030519101 0.84194162 -0.032459894
## Lag1 -0.008183096 -0.06495131 -0.075031842
## Lag2 -0.072499482 -0.08551314 0.059166717
## Lag3 0.060657175 -0.06928771 -0.071243639
## Lag4 -0.075675027 -0.06107462 -0.007825873
## Lag5 1.000000000 -0.05851741 0.011012698
## Volume -0.058517414 1.00000000 -0.033077783
## Today 0.011012698 -0.03307778 1.000000000
corrplot(correlation, type = 'upper')
There are weak correlations between the variables except Year and Volume.
glm.fit <- glm(Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume, data = Weekly, family = binomial)
summary(glm.fit)
##
## Call:
## glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 +
## Volume, family = binomial, data = Weekly)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.6949 -1.2565 0.9913 1.0849 1.4579
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.26686 0.08593 3.106 0.0019 **
## Lag1 -0.04127 0.02641 -1.563 0.1181
## Lag2 0.05844 0.02686 2.175 0.0296 *
## Lag3 -0.01606 0.02666 -0.602 0.5469
## Lag4 -0.02779 0.02646 -1.050 0.2937
## Lag5 -0.01447 0.02638 -0.549 0.5833
## Volume -0.02274 0.03690 -0.616 0.5377
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1496.2 on 1088 degrees of freedom
## Residual deviance: 1486.4 on 1082 degrees of freedom
## AIC: 1500.4
##
## Number of Fisher Scoring iterations: 4
With a .05 significance value, only the lag2 variable is significant. All other variables have p-value greater than .05 and are not significant
predicted <- factor(ifelse(predict(glm.fit, type = "response") < 0.5, "Down", "Up"))
confusionMatrix(predicted, Weekly$Direction, positive = "Up")
## Confusion Matrix and Statistics
##
## Reference
## Prediction Down Up
## Down 54 48
## Up 430 557
##
## Accuracy : 0.5611
## 95% CI : (0.531, 0.5908)
## No Information Rate : 0.5556
## P-Value [Acc > NIR] : 0.369
##
## Kappa : 0.035
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.9207
## Specificity : 0.1116
## Pos Pred Value : 0.5643
## Neg Pred Value : 0.5294
## Prevalence : 0.5556
## Detection Rate : 0.5115
## Detection Prevalence : 0.9063
## Balanced Accuracy : 0.5161
##
## 'Positive' Class : Up
##
Based on the confusion matrix, Accuracy = 56.11%. The model has a Sensitivity of 92.07% and 11.16% of Specificity. From these percentages, we can see that the prediction is correct 92% of the time, when the market is going Up. However, when the market is going Down, the model is only correct 11.16% of the time. An ideal model would have a high rate of Sensitivity and Specificity.
train <- Weekly[Weekly$Year <= 2008, ]
test <- Weekly[Weekly$Year > 2008, ]
glm.fit2 <- glm(Direction ~ Lag2,
data = train,
family = "binomial")
predicted <- factor(ifelse(predict(glm.fit2, newdata = test, type = "response") < 0.5, "Down", "Up"))
confusionMatrix(predicted, test$Direction, positive = "Up")
## Confusion Matrix and Statistics
##
## Reference
## Prediction Down Up
## Down 9 5
## Up 34 56
##
## Accuracy : 0.625
## 95% CI : (0.5247, 0.718)
## No Information Rate : 0.5865
## P-Value [Acc > NIR] : 0.2439
##
## Kappa : 0.1414
##
## Mcnemar's Test P-Value : 7.34e-06
##
## Sensitivity : 0.9180
## Specificity : 0.2093
## Pos Pred Value : 0.6222
## Neg Pred Value : 0.6429
## Prevalence : 0.5865
## Detection Rate : 0.5385
## Detection Prevalence : 0.8654
## Balanced Accuracy : 0.5637
##
## 'Positive' Class : Up
##
Accuracy is not 62.5% when all variables except lag2 are eliminated Sensitivity = 91.08% which has decreased from last model Specificity = 20.93% which has increased from last model
lda.fit <- lda(Direction ~ Lag2, data = train)
predicted_lda <- predict(lda.fit, newdata = test)
confusionMatrix(data = predicted_lda$class,
reference = test$Direction,
positive = "Up")
## Confusion Matrix and Statistics
##
## Reference
## Prediction Down Up
## Down 9 5
## Up 34 56
##
## Accuracy : 0.625
## 95% CI : (0.5247, 0.718)
## No Information Rate : 0.5865
## P-Value [Acc > NIR] : 0.2439
##
## Kappa : 0.1414
##
## Mcnemar's Test P-Value : 7.34e-06
##
## Sensitivity : 0.9180
## Specificity : 0.2093
## Pos Pred Value : 0.6222
## Neg Pred Value : 0.6429
## Prevalence : 0.5865
## Detection Rate : 0.5385
## Detection Prevalence : 0.8654
## Balanced Accuracy : 0.5637
##
## 'Positive' Class : Up
##
The Accuracy score is still at 62.5% using the LDA model. The Sensitivity score is still at 91.80%, and the Specificity score remained at 20.93%.
qda.fit <- qda(Direction ~ Lag2, data = train)
predicted_qda <- predict(qda.fit, newdata = test)
confusionMatrix(data = predicted_qda$class,
reference = test$Direction,
positive = "Up")
## Confusion Matrix and Statistics
##
## Reference
## Prediction Down Up
## Down 0 0
## Up 43 61
##
## Accuracy : 0.5865
## 95% CI : (0.4858, 0.6823)
## No Information Rate : 0.5865
## P-Value [Acc > NIR] : 0.5419
##
## Kappa : 0
##
## Mcnemar's Test P-Value : 1.504e-10
##
## Sensitivity : 1.0000
## Specificity : 0.0000
## Pos Pred Value : 0.5865
## Neg Pred Value : NaN
## Prevalence : 0.5865
## Detection Rate : 0.5865
## Detection Prevalence : 1.0000
## Balanced Accuracy : 0.5000
##
## 'Positive' Class : Up
##
Accuracy has dropped to 58.65% The model is only predicting when market is Up, hence sensitivity 100% and specificity %
train.X <- cbind(train$Lag2)
test.X <- cbind(test$Lag2)
train.Direction <- train$Direction
set.seed(1)
knn.fit<- knn(train.X, test.X, train.Direction, k = 1)
confusionMatrix(knn.fit, test$Direction, positive = "Up")
## Confusion Matrix and Statistics
##
## Reference
## Prediction Down Up
## Down 21 30
## Up 22 31
##
## Accuracy : 0.5
## 95% CI : (0.4003, 0.5997)
## No Information Rate : 0.5865
## P-Value [Acc > NIR] : 0.9700
##
## Kappa : -0.0033
##
## Mcnemar's Test P-Value : 0.3317
##
## Sensitivity : 0.5082
## Specificity : 0.4884
## Pos Pred Value : 0.5849
## Neg Pred Value : 0.4118
## Prevalence : 0.5865
## Detection Rate : 0.2981
## Detection Prevalence : 0.5096
## Balanced Accuracy : 0.4983
##
## 'Positive' Class : Up
##
The model accuracy for KNN is 50%. The Sensitivity is 50.82% and Specificity is 48.84%, which is almost even for both market directions.
Using the target metric as the Accuracy, LDA and Logistic Regression have the highest accuracy scores of 62.5%
KNN - selecting best K
knn.fit2 <- knn(train.X, test.X, train.Direction, k = 10)
confusionMatrix(knn.fit2, test$Direction, positive = "Up")
## Confusion Matrix and Statistics
##
## Reference
## Prediction Down Up
## Down 17 18
## Up 26 43
##
## Accuracy : 0.5769
## 95% CI : (0.4761, 0.6732)
## No Information Rate : 0.5865
## P-Value [Acc > NIR] : 0.6193
##
## Kappa : 0.1031
##
## Mcnemar's Test P-Value : 0.2913
##
## Sensitivity : 0.7049
## Specificity : 0.3953
## Pos Pred Value : 0.6232
## Neg Pred Value : 0.4857
## Prevalence : 0.5865
## Detection Rate : 0.4135
## Detection Prevalence : 0.6635
## Balanced Accuracy : 0.5501
##
## 'Positive' Class : Up
##
KNN using K = 10 Accuracy = 59.73% Sensitivity = 68.85% Specificity = 39.53% Accuracy increased compared to when using K = 1, but logistic and lda still have a better accuracy
knn.fit3 <- knn(train.X, test.X, train.Direction, k = 50)
confusionMatrix(knn.fit3, test$Direction, positive = "Up")
## Confusion Matrix and Statistics
##
## Reference
## Prediction Down Up
## Down 20 22
## Up 23 39
##
## Accuracy : 0.5673
## 95% CI : (0.4665, 0.6641)
## No Information Rate : 0.5865
## P-Value [Acc > NIR] : 0.6921
##
## Kappa : 0.1048
##
## Mcnemar's Test P-Value : 1.0000
##
## Sensitivity : 0.6393
## Specificity : 0.4651
## Pos Pred Value : 0.6290
## Neg Pred Value : 0.4762
## Prevalence : 0.5865
## Detection Rate : 0.3750
## Detection Prevalence : 0.5962
## Balanced Accuracy : 0.5522
##
## 'Positive' Class : Up
##
KNN using K = 50 Accuracy = 56.73% Sensitivity = 63.93% Specificity = 46.51% Accuracy increased compared to when using K = 1, but logistic and lda still have a better accuracy
knn.fit4 <- knn(train.X, test.X, train.Direction, k = 100)
confusionMatrix(knn.fit4, test$Direction, positive = "Up")
## Confusion Matrix and Statistics
##
## Reference
## Prediction Down Up
## Down 9 12
## Up 34 49
##
## Accuracy : 0.5577
## 95% CI : (0.457, 0.655)
## No Information Rate : 0.5865
## P-Value [Acc > NIR] : 0.75792
##
## Kappa : 0.0136
##
## Mcnemar's Test P-Value : 0.00196
##
## Sensitivity : 0.8033
## Specificity : 0.2093
## Pos Pred Value : 0.5904
## Neg Pred Value : 0.4286
## Prevalence : 0.5865
## Detection Rate : 0.4712
## Detection Prevalence : 0.7981
## Balanced Accuracy : 0.5063
##
## 'Positive' Class : Up
##
KNN using K = 100 Accuracy = 55.77% Sensitivity = 78.69% Specificity = 23.26% Accuracy increased compared to when using K = 1, but logistic and lda still have a better accuracy
Logistic Regression using Lag1 and Lag2 and their Interaction.
glm.fit3<-glm(Direction ~ Lag2+Lag1 + Lag1*Lag2, data=train, family=binomial)
glm.fit3.probs<-predict(glm.fit3, newdata = test, type = 'response')
glm.fit3.preds<- rep('Down', length(glm.fit3.probs))
glm.fit3.preds[glm.fit3.probs>0.5]='Up'
confusionMatrix(as.factor(glm.fit3.preds), test$Direction, positive = 'Up')
## Confusion Matrix and Statistics
##
## Reference
## Prediction Down Up
## Down 7 8
## Up 36 53
##
## Accuracy : 0.5769
## 95% CI : (0.4761, 0.6732)
## No Information Rate : 0.5865
## P-Value [Acc > NIR] : 0.6193
##
## Kappa : 0.035
##
## Mcnemar's Test P-Value : 4.693e-05
##
## Sensitivity : 0.8689
## Specificity : 0.1628
## Pos Pred Value : 0.5955
## Neg Pred Value : 0.4667
## Prevalence : 0.5865
## Detection Rate : 0.5096
## Detection Prevalence : 0.8558
## Balanced Accuracy : 0.5158
##
## 'Positive' Class : Up
##
This model with interaction has accuracy = 57.69% which is lower than accuracy score of model with just lag Sensitivity and specificity are also lower at 86.89% and 16.28%
LDA with Lag1 and Lag2 Variables
lda.fit2 <- lda(Direction ~ Lag1 + Lag2, data = train)
predicted_lda2 <- predict(lda.fit2, newdata = test)
confusionMatrix(data = predicted_lda2$class,
reference = test$Direction,
positive = "Up")
## Confusion Matrix and Statistics
##
## Reference
## Prediction Down Up
## Down 7 8
## Up 36 53
##
## Accuracy : 0.5769
## 95% CI : (0.4761, 0.6732)
## No Information Rate : 0.5865
## P-Value [Acc > NIR] : 0.6193
##
## Kappa : 0.035
##
## Mcnemar's Test P-Value : 4.693e-05
##
## Sensitivity : 0.8689
## Specificity : 0.1628
## Pos Pred Value : 0.5955
## Neg Pred Value : 0.4667
## Prevalence : 0.5865
## Detection Rate : 0.5096
## Detection Prevalence : 0.8558
## Balanced Accuracy : 0.5158
##
## 'Positive' Class : Up
##
Including the Lag1 variable gives an Accuracy score of 57.69%, which is still lower than the original LDA model with an Accuracy score of 62.5%
QDA with Lag1 and Lag2 Variables and Their Interaction
qda.fit2 <- qda(Direction ~ Lag1 + Lag2 + Lag1*Lag2, data = train)
predicted_qda2 <- predict(qda.fit2, newdata = test)
confusionMatrix(data = predicted_qda2$class,
reference = test$Direction,
positive = "Up")
## Confusion Matrix and Statistics
##
## Reference
## Prediction Down Up
## Down 23 36
## Up 20 25
##
## Accuracy : 0.4615
## 95% CI : (0.3633, 0.562)
## No Information Rate : 0.5865
## P-Value [Acc > NIR] : 0.99616
##
## Kappa : -0.0524
##
## Mcnemar's Test P-Value : 0.04502
##
## Sensitivity : 0.4098
## Specificity : 0.5349
## Pos Pred Value : 0.5556
## Neg Pred Value : 0.3898
## Prevalence : 0.5865
## Detection Rate : 0.2404
## Detection Prevalence : 0.4327
## Balanced Accuracy : 0.4724
##
## 'Positive' Class : Up
##
Including the Lag1 variable and the interaction between Lag1 and Lag2 gives an even lower Accuracy score of 46.15%, which is still lower than the original QDA model with an Accuracy score of 58.65%. The Sensitivity = 40.98% is lower, however, with the Specificity = 53.49%, the model is also predicting when the market direction is Down, unlike the first QDA model.
The original LDA and Logistic Regression models with the Accuracy score of 62.5% are the best models, with target metric as accuracy
QUESTION 11: In this problem, you will develop a model to predict whether a given car gets high or low gas mileage based on the Auto data set.
attach(Auto)
## The following object is masked from package:ggplot2:
##
## mpg
mpg01<-rep(0, length(mpg))
mpg01[mpg > median(mpg)] = 1
Auto <- data.frame(Auto,mpg01)
pairs(Auto[ ,-9])
M<-cor(Auto[,-9])
corrplot(M, type="upper")
boxplot(cylinders ~ mpg01, data = Auto, main = "Cylinders vs mpg01")
boxplot(displacement ~ mpg01, data = Auto, main = "Displacement vs mpg01")
boxplot(horsepower ~ mpg01, data = Auto, main = "Horsepower vs mpg01")
boxplot(weight ~ mpg01, data = Auto, main = "Weight vs mpg01")
boxplot(acceleration ~ mpg01, data = Auto, main = "Acceleration vs mpg01")
boxplot(year ~ mpg01, data = Auto, main = "Year vs mpg01")
Per the correlation plots, there is a positive correlation between displacement and horsepower, cylinder and horsepower, horsepower and weight, and displacement and weight, and cylinder and weight.
All four variables are inversely correlated with mpg 01.
cor(Auto[ ,-9])
## mpg cylinders displacement horsepower weight
## mpg 1.0000000 -0.7776175 -0.8051269 -0.7784268 -0.8322442
## cylinders -0.7776175 1.0000000 0.9508233 0.8429834 0.8975273
## displacement -0.8051269 0.9508233 1.0000000 0.8972570 0.9329944
## horsepower -0.7784268 0.8429834 0.8972570 1.0000000 0.8645377
## weight -0.8322442 0.8975273 0.9329944 0.8645377 1.0000000
## acceleration 0.4233285 -0.5046834 -0.5438005 -0.6891955 -0.4168392
## year 0.5805410 -0.3456474 -0.3698552 -0.4163615 -0.3091199
## origin 0.5652088 -0.5689316 -0.6145351 -0.4551715 -0.5850054
## mpg01 0.8369392 -0.7591939 -0.7534766 -0.6670526 -0.7577566
## acceleration year origin mpg01
## mpg 0.4233285 0.5805410 0.5652088 0.8369392
## cylinders -0.5046834 -0.3456474 -0.5689316 -0.7591939
## displacement -0.5438005 -0.3698552 -0.6145351 -0.7534766
## horsepower -0.6891955 -0.4163615 -0.4551715 -0.6670526
## weight -0.4168392 -0.3091199 -0.5850054 -0.7577566
## acceleration 1.0000000 0.2903161 0.2127458 0.3468215
## year 0.2903161 1.0000000 0.1815277 0.4299042
## origin 0.2127458 0.1815277 1.0000000 0.5136984
## mpg01 0.3468215 0.4299042 0.5136984 1.0000000
The results above support our initial observation about which variables are correlated, with cylinders, displacement, horsepower, and weight having a negative correlation with mpg01 of over 0.77.
Train/Test Splot of 80/20
set.seed(1)
index <- sample(1:nrow(Auto), 0.8*nrow(Auto))
Auto.train<-Auto[index, ]
Auto.test<-Auto[-index, ]
set.seed(1)
autolda.fit <- lda(mpg01 ~ cylinders + displacement + weight + horsepower, data=Auto.train)
autolda.preds <- predict(autolda.fit, Auto.test)
autolda.class <- autolda.preds$class
confusionMatrix(autolda.class, as.factor(Auto.test$mpg01), positive = "1")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 35 0
## 1 7 37
##
## Accuracy : 0.9114
## 95% CI : (0.8259, 0.9636)
## No Information Rate : 0.5316
## P-Value [Acc > NIR] : 2.819e-13
##
## Kappa : 0.8241
##
## Mcnemar's Test P-Value : 0.02334
##
## Sensitivity : 1.0000
## Specificity : 0.8333
## Pos Pred Value : 0.8409
## Neg Pred Value : 1.0000
## Prevalence : 0.4684
## Detection Rate : 0.4684
## Detection Prevalence : 0.5570
## Balanced Accuracy : 0.9167
##
## 'Positive' Class : 1
##
mean(autolda.class!=Auto.test$mpg01)
## [1] 0.08860759
The test error rate for the LDA model is 8.86%.
autoqda.fit <- qda(mpg01 ~ cylinders + displacement + weight + horsepower, data=Auto.train)
autoqda.preds <- predict(autoqda.fit, Auto.test)
autoqda.class <- autoqda.preds$class
confusionMatrix(autoqda.class, as.factor(Auto.test$mpg01), positive = "1")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 37 2
## 1 5 35
##
## Accuracy : 0.9114
## 95% CI : (0.8259, 0.9636)
## No Information Rate : 0.5316
## P-Value [Acc > NIR] : 2.819e-13
##
## Kappa : 0.8229
##
## Mcnemar's Test P-Value : 0.4497
##
## Sensitivity : 0.9459
## Specificity : 0.8810
## Pos Pred Value : 0.8750
## Neg Pred Value : 0.9487
## Prevalence : 0.4684
## Detection Rate : 0.4430
## Detection Prevalence : 0.5063
## Balanced Accuracy : 0.9134
##
## 'Positive' Class : 1
##
mean(autoqda.class!=Auto.test$mpg01)
## [1] 0.08860759
The test error rate for the QDA model is 8.86%.
autoglm.fit <- glm(mpg01 ~ cylinders + displacement + weight + horsepower, data=Auto.train, family = binomial)
autoglm.preds <- predict(autoglm.fit, newdata = Auto.test, type = "response")
autoglm.class <- ifelse(autoglm.preds>0.5,1,0)
confusionMatrix(as.factor(autoglm.class), as.factor(Auto.test$mpg01), positive = "1")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 38 1
## 1 4 36
##
## Accuracy : 0.9367
## 95% CI : (0.8584, 0.9791)
## No Information Rate : 0.5316
## P-Value [Acc > NIR] : 2.725e-15
##
## Kappa : 0.8735
##
## Mcnemar's Test P-Value : 0.3711
##
## Sensitivity : 0.9730
## Specificity : 0.9048
## Pos Pred Value : 0.9000
## Neg Pred Value : 0.9744
## Prevalence : 0.4684
## Detection Rate : 0.4557
## Detection Prevalence : 0.5063
## Balanced Accuracy : 0.9389
##
## 'Positive' Class : 1
##
mean(autoglm.class!=Auto.test$mpg01)
## [1] 0.06329114
The test error rate for the Logistic Regression model is 6.33%.
K = 1
trainX.Auto <- cbind(Auto.train$cylinders, Auto.train$displacement, Auto.train$weight, Auto.train$horsepower)
testX.Auto <- cbind(Auto.test$cylinders, Auto.test$displacement, Auto.test$weight, Auto.test$horsepower)
Auto.mpg01 <- Auto.train$mpg01
set.seed(1)
autoknn1<- knn(trainX.Auto, testX.Auto, Auto.mpg01, k=1)
confusionMatrix(autoknn1, as.factor(Auto.test$mpg01), positive = "1")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 36 4
## 1 6 33
##
## Accuracy : 0.8734
## 95% CI : (0.7795, 0.9376)
## No Information Rate : 0.5316
## P-Value [Acc > NIR] : 1.017e-10
##
## Kappa : 0.7466
##
## Mcnemar's Test P-Value : 0.7518
##
## Sensitivity : 0.8919
## Specificity : 0.8571
## Pos Pred Value : 0.8462
## Neg Pred Value : 0.9000
## Prevalence : 0.4684
## Detection Rate : 0.4177
## Detection Prevalence : 0.4937
## Balanced Accuracy : 0.8745
##
## 'Positive' Class : 1
##
mean(autoknn1!=Auto.test$mpg01)
## [1] 0.1265823
Using a K=1, the test error is at 12.66% Accuracy = 87.34%.
K = 10
set.seed(1)
autoknn2<- knn(trainX.Auto, testX.Auto, Auto.mpg01, k=10)
confusionMatrix(autoknn2, as.factor(Auto.test$mpg01), positive = "1")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 35 2
## 1 7 35
##
## Accuracy : 0.8861
## 95% CI : (0.7947, 0.9466)
## No Information Rate : 0.5316
## P-Value [Acc > NIR] : 1.615e-11
##
## Kappa : 0.7731
##
## Mcnemar's Test P-Value : 0.1824
##
## Sensitivity : 0.9459
## Specificity : 0.8333
## Pos Pred Value : 0.8333
## Neg Pred Value : 0.9459
## Prevalence : 0.4684
## Detection Rate : 0.4430
## Detection Prevalence : 0.5316
## Balanced Accuracy : 0.8896
##
## 'Positive' Class : 1
##
mean(autoknn2!=Auto.test$mpg01)
## [1] 0.1139241
Using a K=10, the test error is at 11.39%, Accuracy = 88.61%.
K = 80
set.seed(1)
autoknn3<- knn(trainX.Auto, testX.Auto, Auto.mpg01, k=80)
confusionMatrix(autoknn3, as.factor(Auto.test$mpg01), positive = "1")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 35 4
## 1 7 33
##
## Accuracy : 0.8608
## 95% CI : (0.7645, 0.9284)
## No Information Rate : 0.5316
## P-Value [Acc > NIR] : 5.745e-10
##
## Kappa : 0.7217
##
## Mcnemar's Test P-Value : 0.5465
##
## Sensitivity : 0.8919
## Specificity : 0.8333
## Pos Pred Value : 0.8250
## Neg Pred Value : 0.8974
## Prevalence : 0.4684
## Detection Rate : 0.4177
## Detection Prevalence : 0.5063
## Balanced Accuracy : 0.8626
##
## 'Positive' Class : 1
##
mean(autoknn3!=Auto.test$mpg01)
## [1] 0.1392405
Using a K=80, the test error is at 13.92%, Accuracy = 86.08%.
The k=10 model seems to perform the best, with the lowest test error and highest Accuracy score.
QUESTION 13: Using the Boston data set, fit classification models in order to predict whether a given suburb has a crime rate above or below the median. Explore logistic regression, LDA, and KNN models using various subsets of the predictors. Describe your findings.
attach(Boston)
#new column
crime01 <- rep(0, length(crim))
crime01[crim > median(crim)] = 1
Boston <- data.frame(Boston, crime01)
#create test and train
train = 1:(dim(Boston)[1]/2)
test = (dim(Boston)[1]/2 + 1):dim(Boston)[1]
Boston.train = Boston[train, ]
Boston.test = Boston[test, ]
crime01.test = crime01[test]
Logistic model
glm.crim.fit = glm(crime01 ~ . - crime01 - crim, data = Boston.train, family = binomial)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
glm.crim01.probs <- predict(glm.crim.fit, Boston.test, type = 'response')
glm.crim01.preds <- rep(0, length(glm.crim01.probs))
glm.crim01.preds[glm.crim01.probs > 0.5] = 1
caret::confusionMatrix(as.factor(glm.crim01.preds), as.factor(Boston.test$crime01), positive = "1")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 68 24
## 1 22 139
##
## Accuracy : 0.8182
## 95% CI : (0.765, 0.8637)
## No Information Rate : 0.6443
## P-Value [Acc > NIR] : 9.016e-10
##
## Kappa : 0.6053
##
## Mcnemar's Test P-Value : 0.8828
##
## Sensitivity : 0.8528
## Specificity : 0.7556
## Pos Pred Value : 0.8634
## Neg Pred Value : 0.7391
## Prevalence : 0.6443
## Detection Rate : 0.5494
## Detection Prevalence : 0.6364
## Balanced Accuracy : 0.8042
##
## 'Positive' Class : 1
##
mean(glm.crim01.preds != crime01.test)
## [1] 0.1818182
For a logistic model, the accuracy = 81.82% and the test error rate = 18.18%
LDA model
lda.crim01.fit <- lda(crime01 ~ . - crim, Boston.train)
lda.crim01.preds <- predict(lda.crim01.fit, Boston.test)
lda.crim01.class <- lda.crim01.preds$class
mean(lda.crim01.preds$class != crime01.test)
## [1] 0.1343874
mean(lda.crim01.preds$class == crime01.test)
## [1] 0.8656126
For an LDA model, the accuracy = 86.56% and the test error rate = 13.44%
KNN model k = 1
train.X = cbind(zn, indus, chas, nox, rm, age, dis, rad, tax, ptratio, black,
lstat, medv)[train, ]
test.X = cbind(zn, indus, chas, nox, rm, age, dis, rad, tax, ptratio, black,
lstat, medv)[test, ]
train.crime01 = crime01[train]
set.seed(1)
knn.pred = knn(train.X, test.X, train.crime01, k = 1)
caret::confusionMatrix(knn.pred, as.factor(Boston.test$crime01), positive = "1")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 85 111
## 1 5 52
##
## Accuracy : 0.5415
## 95% CI : (0.4779, 0.6041)
## No Information Rate : 0.6443
## P-Value [Acc > NIR] : 0.9997
##
## Kappa : 0.2085
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.3190
## Specificity : 0.9444
## Pos Pred Value : 0.9123
## Neg Pred Value : 0.4337
## Prevalence : 0.6443
## Detection Rate : 0.2055
## Detection Prevalence : 0.2253
## Balanced Accuracy : 0.6317
##
## 'Positive' Class : 1
##
mean(knn.pred != crime01.test)
## [1] 0.458498
When k = 1, the model accuracy is 54.15% and the test error rate is 45.85%
k = 5
knn.pred = knn(train.X, test.X, train.crime01, k = 5)
caret::confusionMatrix(knn.pred, as.factor(Boston.test$crime01), positive = "1")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 84 37
## 1 6 126
##
## Accuracy : 0.83
## 95% CI : (0.778, 0.8742)
## No Information Rate : 0.6443
## P-Value [Acc > NIR] : 5.088e-11
##
## Kappa : 0.6558
##
## Mcnemar's Test P-Value : 4.763e-06
##
## Sensitivity : 0.7730
## Specificity : 0.9333
## Pos Pred Value : 0.9545
## Neg Pred Value : 0.6942
## Prevalence : 0.6443
## Detection Rate : 0.4980
## Detection Prevalence : 0.5217
## Balanced Accuracy : 0.8532
##
## 'Positive' Class : 1
##
mean(knn.pred != crime01.test)
## [1] 0.1699605
When k = 5, the model accuracy is 83% and the test error rate is 17.00%
k = 7
knn.pred = knn(train.X, test.X, train.crime01, k = 7)
caret::confusionMatrix(knn.pred, as.factor(Boston.test$crime01), positive = "1")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 83 22
## 1 7 141
##
## Accuracy : 0.8854
## 95% CI : (0.8395, 0.9219)
## No Information Rate : 0.6443
## P-Value [Acc > NIR] : < 2e-16
##
## Kappa : 0.7589
##
## Mcnemar's Test P-Value : 0.00933
##
## Sensitivity : 0.8650
## Specificity : 0.9222
## Pos Pred Value : 0.9527
## Neg Pred Value : 0.7905
## Prevalence : 0.6443
## Detection Rate : 0.5573
## Detection Prevalence : 0.5850
## Balanced Accuracy : 0.8936
##
## 'Positive' Class : 1
##
mean(knn.pred != crime01.test)
## [1] 0.1146245
When k = 7, the model accuracy is 88.54% and the test error rate is 11.46%