This question should be answered using the Weekly data set, which is part of the ISLR package. This data is similar in nature to the Smarket data from this chapter’s lab, except that it contains 1,089 weekly returns for 21 years, from the beginning of 1990 to the end of 2010.
glimpse(Weekly)
## Rows: 1,089
## Columns: 9
## $ Year <dbl> 1990, 1990, 1990, 1990, 1990, 1990, 1990, 1990, 1990, 199...
## $ Lag1 <dbl> 0.816, -0.270, -2.576, 3.514, 0.712, 1.178, -1.372, 0.807...
## $ Lag2 <dbl> 1.572, 0.816, -0.270, -2.576, 3.514, 0.712, 1.178, -1.372...
## $ Lag3 <dbl> -3.936, 1.572, 0.816, -0.270, -2.576, 3.514, 0.712, 1.178...
## $ Lag4 <dbl> -0.229, -3.936, 1.572, 0.816, -0.270, -2.576, 3.514, 0.71...
## $ Lag5 <dbl> -3.484, -0.229, -3.936, 1.572, 0.816, -0.270, -2.576, 3.5...
## $ Volume <dbl> 0.1549760, 0.1485740, 0.1598375, 0.1616300, 0.1537280, 0....
## $ Today <dbl> -0.270, -2.576, 3.514, 0.712, 1.178, -1.372, 0.807, 0.041...
## $ Direction <fct> Down, Down, Up, Up, Up, Down, Up, Up, Up, Down, Down, Up,...
Q: Produce some numerical and graphical summaries of the Weekly data. Do there appear to be any patterns?
A:
An initial glance at the numeric variables of the dataset:
pairs(Weekly[ ,-9])
abs(cor(Weekly[ ,-9]))
## Year Lag1 Lag2 Lag3 Lag4 Lag5
## Year 1.00000000 0.032289274 0.03339001 0.03000649 0.031127923 0.030519101
## Lag1 0.03228927 1.000000000 0.07485305 0.05863568 0.071273876 0.008183096
## Lag2 0.03339001 0.074853051 1.00000000 0.07572091 0.058381535 0.072499482
## Lag3 0.03000649 0.058635682 0.07572091 1.00000000 0.075395865 0.060657175
## Lag4 0.03112792 0.071273876 0.05838153 0.07539587 1.000000000 0.075675027
## Lag5 0.03051910 0.008183096 0.07249948 0.06065717 0.075675027 1.000000000
## Volume 0.84194162 0.064951313 0.08551314 0.06928771 0.061074617 0.058517414
## Today 0.03245989 0.075031842 0.05916672 0.07124364 0.007825873 0.011012698
## Volume Today
## Year 0.84194162 0.032459894
## Lag1 0.06495131 0.075031842
## Lag2 0.08551314 0.059166717
## Lag3 0.06928771 0.071243639
## Lag4 0.06107462 0.007825873
## Lag5 0.05851741 0.011012698
## Volume 1.00000000 0.033077783
## Today 0.03307778 1.000000000
As we would expect with stock market data, there are no obvious strong relationships between the Lag variables. However, there do appear to be some interesting trends over time. I create the Week variable below, allowing for easier plotting of trends, since there is a chronology to the rows that is not shown fully through the Year variable.
I first do a quick sense-check that the rows are in the correct order, based on the definition of Today and Lag1:
Weekly %>%
filter(lead(Lag1) != Today) %>%
nrow()
## [1] 0
Since there are no rows out of order, the dataset appears to be correctly ordered in ascending weeks, so I create Week (basically a row counter):
Weekly$Week <- 1:nrow(Weekly)
Looking at Volume over time, there has been a significant increase in the volume of shares traded since the 90’s. This appears to have peaked around 2009, starting to decrease in 2010. it would be interesting to see the S&P 500 stats since then.
year_breaks <- Weekly %>%
group_by(Year) %>%
summarize(Week = min(Week))
## `summarise()` ungrouping output (override with `.groups` argument)
ggplot(Weekly, aes(x = Week, y = Volume)) +
geom_line() +
geom_smooth() +
scale_x_continuous(breaks = year_breaks$Week, minor_breaks = NULL, labels = year_breaks$Year) +
labs(title = "Average Daily Shares Traded vs Time",
x = "Time") +
theme_light()
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
Here is Direction over time, which is less interesting. There appear to only be 4 years in which >= 50% of the weeks didn’t see a positive return (2000, 2001, 2002, 2008).
ggplot(Weekly, aes(x = Year, fill = Direction)) +
geom_bar(position = "fill") +
geom_hline(yintercept = 0.5, col = "grey45") +
scale_x_continuous(breaks = seq(1990, 2010), minor_breaks = NULL) +
scale_y_continuous(labels = scales::percent_format()) +
theme_light() +
theme(axis.title.y = element_blank(),
legend.position = "bottom") +
ggtitle("% of Up/Down Weeks vs Time")
The split of the weeks into Down & Up can be seen in the table below. We could get a classifier with 55.56% accuracy simply by predicting the S&P 500 return will be positive every week.
prop.table(table(Weekly$Direction))
##
## Down Up
## 0.4444444 0.5555556
We can also see that the market seems to go through periods of higher variance/instability. Crashes (e.g. Sept. 2008) stand out here.
ggplot(Weekly, aes(x = Week, y = Today / 100)) +
geom_line() +
scale_x_continuous(breaks = year_breaks$Week, minor_breaks = NULL, labels = year_breaks$Year) +
scale_y_continuous(labels = scales::percent_format(), breaks = seq(-0.2, 0.2, 0.05)) +
geom_hline(yintercept = 0, col = "grey55") +
theme_light() +
labs(title = "Weekly Percentage Return vs Time",
x = "Time",
y = "Percentage Return")
Q: Use the full data set to perform a logistic regression with Direction as the response and the five lag variables plus Volume as predictors. Use the summary function to print the results. Do any of the predictors appear to be statistically significant? If so, which ones?
A:
Lag2 appears to be the only statistically significant predictor:
glm_dir <- glm(Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume,
data = Weekly,
family = "binomial")
summary(glm_dir)
##
## Call:
## glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 +
## Volume, family = "binomial", data = Weekly)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.6949 -1.2565 0.9913 1.0849 1.4579
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.26686 0.08593 3.106 0.0019 **
## Lag1 -0.04127 0.02641 -1.563 0.1181
## Lag2 0.05844 0.02686 2.175 0.0296 *
## Lag3 -0.01606 0.02666 -0.602 0.5469
## Lag4 -0.02779 0.02646 -1.050 0.2937
## Lag5 -0.01447 0.02638 -0.549 0.5833
## Volume -0.02274 0.03690 -0.616 0.5377
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1496.2 on 1088 degrees of freedom
## Residual deviance: 1486.4 on 1082 degrees of freedom
## AIC: 1500.4
##
## Number of Fisher Scoring iterations: 4
Note that the output from the Z-statistic (z value) is calculated the same as with the T-test in linear regression: Z=βjSE(βj), and a large absolute value indicates evidence against the null hypothesis H0:βj=0 (again, the same).
Q: Compute the confusion matrix and overall fraction of correct predictions. Explain what the confusion matrix is telling you about the types of mistakes made by logistic regression.
A:
The confusion matrix is shown below. I use caret::confusionMatrix(), as it has various confusion matrix statistics built in to its output.
predicted <- factor(ifelse(predict(glm_dir, type = "response") < 0.5, "Down", "Up"))
confusionMatrix(predicted, Weekly$Direction, positive = "Up")
## Confusion Matrix and Statistics
##
## Reference
## Prediction Down Up
## Down 54 48
## Up 430 557
##
## Accuracy : 0.5611
## 95% CI : (0.531, 0.5908)
## No Information Rate : 0.5556
## P-Value [Acc > NIR] : 0.369
##
## Kappa : 0.035
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.9207
## Specificity : 0.1116
## Pos Pred Value : 0.5643
## Neg Pred Value : 0.5294
## Prevalence : 0.5556
## Detection Rate : 0.5115
## Detection Prevalence : 0.9063
## Balanced Accuracy : 0.5161
##
## 'Positive' Class : Up
##
This is just over the baseline accuracy (55.56%) achieved by a naive classifier (that predicts Up every time). In fact, this is almost the strategy of the logistic regression model:
prop.table(table(predicted))
## predicted
## Down Up
## 0.09366391 0.90633609
This is reflected in the very poor specificity (it does not predict the negative class well).
Note also that we are dealing with training accuracy here, so this marginal accuracy improvement over the baseline is not interesting.
Q: Now fit the logistic regression model using a training data period from 1990 to 2008, with Lag2 as the only predictor. Compute the confusion matrix and the overall fraction of correct predictions for the held out data (that is, the data from 2009 and 2010).
A:
I create train and test, corresponding to the two time periods given in the question.
The confusion matrix is for the test predictions this time.
train <- Weekly[Weekly$Year <= 2008, ]
test <- Weekly[Weekly$Year > 2008, ]
glm_dir <- glm(Direction ~ Lag2,
data = train,
family = "binomial")
predicted <- factor(ifelse(predict(glm_dir, newdata = test, type = "response") < 0.5, "Down", "Up"))
confusionMatrix(predicted, test$Direction, positive = "Up")
## Confusion Matrix and Statistics
##
## Reference
## Prediction Down Up
## Down 9 5
## Up 34 56
##
## Accuracy : 0.625
## 95% CI : (0.5247, 0.718)
## No Information Rate : 0.5865
## P-Value [Acc > NIR] : 0.2439
##
## Kappa : 0.1414
##
## Mcnemar's Test P-Value : 7.34e-06
##
## Sensitivity : 0.9180
## Specificity : 0.2093
## Pos Pred Value : 0.6222
## Neg Pred Value : 0.6429
## Prevalence : 0.5865
## Detection Rate : 0.5385
## Detection Prevalence : 0.8654
## Balanced Accuracy : 0.5637
##
## 'Positive' Class : Up
##
Here we get an Accuracy of 0.625.
The confusionMatrix() function provides a lot of other useful statistics. For example, No Information Rate : 0.5865 tells us that the largest class (Up) is 58.65% of the test observations, and hence this is our baseline for a classifier.
Clearly Accuracy : 0.625 > 0.5865, which is positive. However, our test dataset is relatively small so this might not be a meaningful improvement.
We are provided with a p-value for a one-sided test to see whether the accuracy is better than the “no information rate”. P-Value [Acc > NIR] : 0.2439 > 0.05 ⟹ no significant evidence that our classifier is better than the baseline strategy. Predicting stock movements is hard - who would’ve thought?
Q: Repeat (d) using LDA.
A:
lda_dir <- lda(Direction ~ Lag2, data = train)
predicted_lda <- predict(lda_dir, newdata = test)
confusionMatrix(data = predicted_lda$class,
reference = test$Direction,
positive = "Up")
## Confusion Matrix and Statistics
##
## Reference
## Prediction Down Up
## Down 9 5
## Up 34 56
##
## Accuracy : 0.625
## 95% CI : (0.5247, 0.718)
## No Information Rate : 0.5865
## P-Value [Acc > NIR] : 0.2439
##
## Kappa : 0.1414
##
## Mcnemar's Test P-Value : 7.34e-06
##
## Sensitivity : 0.9180
## Specificity : 0.2093
## Pos Pred Value : 0.6222
## Neg Pred Value : 0.6429
## Prevalence : 0.5865
## Detection Rate : 0.5385
## Detection Prevalence : 0.8654
## Balanced Accuracy : 0.5637
##
## 'Positive' Class : Up
##
Here we get an Accuracy of 0.625. Note that, as before, we have P-Value [Acc > NIR] : 0.2439 > 0.05, so whilst the accuracy of the classifier is 0.625 > 0.5865, the test sample size is not large enough for this increase over the baseline to be meaningful.
predict.lda()\(posterior gives a data frame of probability predictions, with one column per response class. predict.lda()\)classgives the class prediction for each observation (the class with the greatest probability). I check this below:
identical(as.character(predicted_lda$class),
as.character(ifelse(predicted_lda$posterior[ ,2] < 0.5, "Down", "Up")))
## [1] TRUE
Q: Repeat (d) using QDA.
A:
qda_dir <- qda(Direction ~ Lag2, data = train)
predicted_qda <- predict(qda_dir, newdata = test)
confusionMatrix(data = predicted_qda$class,
reference = test$Direction,
positive = "Up")
## Confusion Matrix and Statistics
##
## Reference
## Prediction Down Up
## Down 0 0
## Up 43 61
##
## Accuracy : 0.5865
## 95% CI : (0.4858, 0.6823)
## No Information Rate : 0.5865
## P-Value [Acc > NIR] : 0.5419
##
## Kappa : 0
##
## Mcnemar's Test P-Value : 1.504e-10
##
## Sensitivity : 1.0000
## Specificity : 0.0000
## Pos Pred Value : 0.5865
## Neg Pred Value : NaN
## Prevalence : 0.5865
## Detection Rate : 0.5865
## Detection Prevalence : 1.0000
## Balanced Accuracy : 0.5000
##
## 'Positive' Class : Up
##
Here we get an Accuracy of 0.5865.
Note that the QDA classifier just predicts Up for every test observation - it behaves identically to the naive classifier on this dataset, with a sensitivity of 1 and a specificity of 0.
Q: Repeat (d) using KNN with K = 1.
A:
Usually as a pre-processing step for KNN (with multiple predictors over different scales), we would want to standardize the predictors (xnew=x−μσ) so that each xnew will have a mean of 0 and standard deviation of 1. In this case, however, there is only 1 predictor (Lag2), so the nearest neighbour would not be effected by this.
Note also that there is some element of randomness with the KNN classifier. Take the following test observation that requires prediction:
test[100, "Lag2"]
## [1] 0.043
Notice that there are two train observations of identical distance (w.r.t Lag2), but both have different Direction values:
train[c(10, 808), c("Lag2", "Direction")]
## Lag2 Direction
## 10 0.041 Down
## 808 0.041 Up
In this case, the KNN probability will be 0.5:
set.seed(1)
predicted_knn <- knn(train = data.frame(Lag2 = train$Lag2),
test = data.frame(Lag2 = test$Lag2),
cl = train$Direction,
k = 1,
prob = T)
attr(predicted_knn, "prob")[100]
## [1] 0.5
However, the classifier must make a prediction, which will just be chosen at random:
predicted_knn[100]
## [1] Down
## Levels: Down Up
See the confusion matrix below:
confusionMatrix(data = predicted_knn,
reference = test$Direction,
positive = "Up")
## Confusion Matrix and Statistics
##
## Reference
## Prediction Down Up
## Down 21 30
## Up 22 31
##
## Accuracy : 0.5
## 95% CI : (0.4003, 0.5997)
## No Information Rate : 0.5865
## P-Value [Acc > NIR] : 0.9700
##
## Kappa : -0.0033
##
## Mcnemar's Test P-Value : 0.3317
##
## Sensitivity : 0.5082
## Specificity : 0.4884
## Pos Pred Value : 0.5849
## Neg Pred Value : 0.4118
## Prevalence : 0.5865
## Detection Rate : 0.2981
## Detection Prevalence : 0.5096
## Balanced Accuracy : 0.4983
##
## 'Positive' Class : Up
##
Here we get an accuracy of 0.5, which is again worse than the baseline.
Q: Which of these methods appears to provide the best results on this data?
A:
Taking the target metric as the accuracy of the classifier: LDA & Logistic Regression get the same test accuracy of 0.625, so these two are tied.
Q: Experiment with different combinations of predictors, including possible transformations and interactions, for each of the methods. Report the variables, method, and associated confusion matrix that appears to provide the best results on the held out data. Note that you should also experiment with values for K in the KNN classifier.
A:
KNN - selecting best K (using cross-validation):
train$Today <- NULL
ctrl <- trainControl(method = "repeatedcv",
number = 5,
repeats = 5)
set.seed(111)
knn_train <- train(y = train$Direction,
x = train[ ,-8],
method = "knn",
metric = "Accuracy",
preProcess = c("center", "scale"),
tuneGrid = expand.grid(k = seq(1, 50, 2)),
trControl = ctrl)
caret::varImp(knn_train)
## ROC curve variable importance
##
## Importance
## Lag1 100.000
## Lag2 77.256
## Lag5 64.309
## Year 45.659
## Volume 43.735
## Week 42.513
## Lag4 4.578
## Lag3 0.000
knn_train
## k-Nearest Neighbors
##
## 985 samples
## 8 predictor
## 2 classes: 'Down', 'Up'
##
## Pre-processing: centered (8), scaled (8)
## Resampling: Cross-Validated (5 fold, repeated 5 times)
## Summary of sample sizes: 788, 788, 788, 788, 788, 787, ...
## Resampling results across tuning parameters:
##
## k Accuracy Kappa
## 1 0.4947996 -0.020756148
## 3 0.5232477 0.031373990
## 5 0.5230292 0.028769372
## 7 0.5372435 0.053353378
## 9 0.5372415 0.049612758
## 11 0.5343834 0.040819116
## 13 0.5346081 0.037598994
## 15 0.5368437 0.040363712
## 17 0.5317675 0.026161027
## 19 0.5301555 0.021608137
## 21 0.5370622 0.033488872
## 23 0.5330064 0.024861594
## 25 0.5401182 0.036317004
## 27 0.5376693 0.029998090
## 29 0.5321890 0.015868622
## 31 0.5338310 0.018687256
## 33 0.5360563 0.021431639
## 35 0.5322138 0.012669022
## 37 0.5283538 0.002535516
## 39 0.5330239 0.011432961
## 41 0.5289589 0.002400751
## 43 0.5295618 0.003081904
## 45 0.5299700 0.003587628
## 47 0.5303854 0.003745293
## 49 0.5320056 0.005552184
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 25.
ggplot(knn_train) +
geom_smooth() +
theme_light() +
scale_y_continuous(labels = scales::percent_format()) +
ggtitle("KNN - 'K' Selection (5-repeated 5-fold cross-validation)")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
caret automatically chooses the best value for k. Evaluating the performance of this new model on test:
knn_pred <- predict(knn_train, newdata = test)
confusionMatrix(data = knn_pred,
reference = test$Direction,
positive = "Up")
## Confusion Matrix and Statistics
##
## Reference
## Prediction Down Up
## Down 19 20
## Up 24 41
##
## Accuracy : 0.5769
## 95% CI : (0.4761, 0.6732)
## No Information Rate : 0.5865
## P-Value [Acc > NIR] : 0.6193
##
## Kappa : 0.1156
##
## Mcnemar's Test P-Value : 0.6511
##
## Sensitivity : 0.6721
## Specificity : 0.4419
## Pos Pred Value : 0.6308
## Neg Pred Value : 0.4872
## Prevalence : 0.5865
## Detection Rate : 0.3942
## Detection Prevalence : 0.6250
## Balanced Accuracy : 0.5570
##
## 'Positive' Class : Up
##
n this problem, you will develop a model to predict whether a given car gets high or low gas mileage based on the Auto data set.
glimpse(Auto)
## Rows: 392
## Columns: 9
## $ mpg <dbl> 18, 15, 18, 16, 17, 15, 14, 14, 14, 15, 15, 14, 15, 14...
## $ cylinders <dbl> 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 4, 6, 6, 6, ...
## $ displacement <dbl> 307, 350, 318, 304, 302, 429, 454, 440, 455, 390, 383,...
## $ horsepower <dbl> 130, 165, 150, 150, 140, 198, 220, 215, 225, 190, 170,...
## $ weight <dbl> 3504, 3693, 3436, 3433, 3449, 4341, 4354, 4312, 4425, ...
## $ acceleration <dbl> 12.0, 11.5, 11.0, 12.0, 10.5, 10.0, 9.0, 8.5, 10.0, 8....
## $ year <dbl> 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70...
## $ origin <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 3, 1, 1, 1, ...
## $ name <fct> chevrolet chevelle malibu, buick skylark 320, plymouth...
A:
I’ll just add the mpg01 variable to the existing Auto dataset.
Auto$mpg01 <- factor(as.numeric(Auto$mpg > median(Auto$mpg)))
table(Auto$mpg01)
##
## 0 1
## 196 196
A:
Excluding name (categorical with 301 unique values) and mpg (used to create mpg01), and converting origin to a factor:
Auto$name <- NULL
Auto$mpg <- NULL
Auto$origin <- factor(Auto$origin, labels = c("American", "European", "Japanese"))
glimpse(Auto)
## Rows: 392
## Columns: 8
## $ cylinders <dbl> 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 4, 6, 6, 6, ...
## $ displacement <dbl> 307, 350, 318, 304, 302, 429, 454, 440, 455, 390, 383,...
## $ horsepower <dbl> 130, 165, 150, 150, 140, 198, 220, 215, 225, 190, 170,...
## $ weight <dbl> 3504, 3693, 3436, 3433, 3449, 4341, 4354, 4312, 4425, ...
## $ acceleration <dbl> 12.0, 11.5, 11.0, 12.0, 10.5, 10.0, 9.0, 8.5, 10.0, 8....
## $ year <dbl> 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70...
## $ origin <fct> American, American, American, American, American, Amer...
## $ mpg01 <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, ...
I chose a jitter plot (scatter plot with some random noise added) for the relationship between mpg01 & cylinders, since cylinders is discrete with only 3 unique values, so a box plot wasn’t very informative.
g1 <- ggplot(Auto, aes(x = mpg01, y = cylinders, col = mpg01)) +
geom_jitter() +
theme(legend.position = "none") +
ggtitle("Cylinders vs mpg01 - Jitter Plot")
g2 <- ggplot(Auto, aes(x = mpg01, y = displacement, fill = mpg01)) +
geom_boxplot() +
theme(legend.position = "none") +
ggtitle("Displacement vs mpg01 - Boxplot")
g3 <- ggplot(Auto, aes(x = mpg01, y = horsepower, fill = mpg01)) +
geom_boxplot() +
theme(legend.position = "none") +
ggtitle("Horsepower vs mpg01 - Boxplot")
g4 <- ggplot(Auto, aes(x = mpg01, y = weight, fill = mpg01)) +
geom_boxplot() +
theme(legend.position = "none") +
ggtitle("Weight vs mpg01 - Boxplot")
g5 <- ggplot(Auto, aes(x = mpg01, y = acceleration, fill = mpg01)) +
geom_boxplot() +
theme(legend.position = "none") +
ggtitle("Acceleration vs mpg01 - Boxplot")
g6 <- ggplot(Auto, aes(x = mpg01, y = year, fill = mpg01)) +
geom_boxplot() +
theme(legend.position = "none") +
ggtitle("Year vs mpg01 - Boxplot")
g7 <- ggplot(Auto, aes(x = origin, fill = mpg01)) +
geom_bar(position = "fill") +
scale_y_continuous(labels = scales::percent_format()) +
theme(axis.title.y = element_blank()) +
ggtitle("Origin vs mpg01 - Bar Plot")
grid.arrange(g1, g2, g3, g4, g5, g6, g7,
ncol = 2,
nrow = 4)
It’s tricky to compare predictive power between different graphs (e.g. scatter plots & box plots), but if I had to guess I would say that cylinders, displacement & weight are the three strongest predictors of mpg01.
horsepower looks like it might be a little weaker, with year & acceleration weaker still (more significant overlap of the distributions).
It is harder to judge how strong a predictor origin is visually when compared with the other variables, since we are looking at a relationship between two nominal categorical variables, but it certainly appears to be on the weaker side.
A:
I create train & test from the Auto dataset. I leave name & mpg excluded for now. train makes up 50% of the data:
set.seed(444)
index <- createDataPartition(y = Auto$mpg01, p = 0.5, list = F)
train <- Auto[index, ]
test <- Auto[-index, ]
nrow(train) / nrow(Auto)
## [1] 0.5
Q: Perform LDA on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01 in (b). What is the test error of the model obtained?
A:
For (d), (e), (f) & (g), I’ll use the top 4 strongest predictors I identified visually (cylinders, displacement, weight & horsepower).
Furthermore, as these questions are very similar to question 10 (b) - (g), I thought I would investigate how to train all of these models using train() from the caret package instead. This means we can easily report two error metrics too (the train cross-validation error, and the test error).
Unlike in the last question, with this dataset we are performing far better than the baseline classifier accuracy of 50%.
ctrl <- trainControl(method = "repeatedcv", number = 10, repeats = 3)
set.seed(1)
lda_mpg <- train(mpg01 ~ cylinders + displacement + weight + horsepower,
data = train,
method = "lda",
trControl = ctrl)
train (cross-validation) error:
1 - lda_mpg$results$Accuracy # cv error
## [1] 0.09125731
test error:
predicted_lda <- predict(lda_mpg, newdata = test, type = "raw") # as opposed to type = "prob"
mean(predicted_lda != test$mpg01)
## [1] 0.122449
Q: Perform QDA on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01 in (b). What is the test error of the model obtained?
A:
set.seed(2)
qda_mpg <- train(mpg01 ~ cylinders + displacement + weight + horsepower,
data = train,
method = "qda",
trControl = ctrl)
train (cross-validation) error:
1 - qda_mpg$results$Accuracy
## [1] 0.08991228
test error:
predicted_qda <- predict(qda_mpg, newdata = test, type = "raw")
mean(predicted_qda != test$mpg01)
## [1] 0.1173469
In this case, QDA appears to perform better than LDA with respect to test error, and slightly better in terms of CV error.
Q: Perform logistic regression on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01 in (b). What is the test error of the model obtained?
A:
set.seed(3)
log_mpg <- train(mpg01 ~ cylinders + displacement + weight + horsepower,
data = train,
method = "glm",
family = "binomial",
trControl = ctrl)
train (cross-validation) error:
1 - log_mpg$results$Accuracy
## [1] 0.1084405
test error:
predicted_log <- predict(log_mpg, newdata = test, type = "raw")
mean(predicted_log != test$mpg01)
## [1] 0.1173469
Logistic Regression performs better than LDA & QDA with respect to CV error & test error.
Q: Perform KNN on the training data, with several values of K, in order to predict mpg01. Use only the variables that seemed most associated with mpg01 in (b). What test errors do you obtain? Which value of K seems to perform the best on this data set?
A:
I pass a grid of hyperparameters (values for K) to caret::train(). caret will test each value for K using the cross-validation scheme defined by ctrl (3-repeated 10-fold cross-validation), automatically select the best value for K with respect to a target metric (out-of-sample accuracy), then retrain the KNN model on the full train dataset with this optimal value for K.
Specifying preProcess = c(“center”, “scale”) centers and scales (standardizes) the 4 predictors, which is necessary for distance-based methods such as KNN where the predictors are over different scales.
set.seed(4)
knn_mpg <- train(mpg01 ~ cylinders + displacement + weight + horsepower,
data = train,
method = "knn",
preProcess = c("center", "scale"),
trControl = ctrl,
tuneGrid = expand.grid(k = seq(1, 85, 3)))
knn_mpg
## k-Nearest Neighbors
##
## 196 samples
## 4 predictor
## 2 classes: '0', '1'
##
## Pre-processing: centered (4), scaled (4)
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 176, 177, 176, 176, 176, 176, ...
## Resampling results across tuning parameters:
##
## k Accuracy Kappa
## 1 0.8779532 0.7554422
## 4 0.8999805 0.7995964
## 7 0.9033138 0.8062631
## 10 0.9033138 0.8062631
## 13 0.9033138 0.8062631
## 16 0.9033138 0.8062631
## 19 0.9033138 0.8062631
## 22 0.9033138 0.8062631
## 25 0.9033138 0.8062631
## 28 0.9033138 0.8062631
## 31 0.9033138 0.8062631
## 34 0.9033138 0.8062631
## 37 0.9033138 0.8062631
## 40 0.9033138 0.8062631
## 43 0.9033138 0.8062631
## 46 0.9033138 0.8062631
## 49 0.9033138 0.8062631
## 52 0.9033138 0.8062631
## 55 0.9033138 0.8062631
## 58 0.9033138 0.8062631
## 61 0.9033138 0.8062631
## 64 0.9049805 0.8095964
## 67 0.9049805 0.8095964
## 70 0.9066472 0.8129297
## 73 0.9100682 0.8197621
## 76 0.9083138 0.8162631
## 79 0.9084016 0.8164288
## 82 0.9067349 0.8130955
## 85 0.9034016 0.8065072
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 73.
train (cross-validation) error:
1 - max(knn_mpg$results$Accuracy)
## [1] 0.08993177
test error:
predicted_knn <- predict(knn_mpg, newdata = test, type = "raw")
mean(predicted_knn != test$mpg01)
## [1] 0.1122449
The optimal value chosen was K = 7, although this model performed close to Logistic Regression in CV error, and slightly worse in test error.
There appears to be some quirk with the data where all classifiers with mid-large values for K (from 10 to 73) seem to be getting the same cross-validation accuracy scores, presumably because they are making identical predictions.
I think identifying why exactly this is happening would require some investigation, but I think it is likely due to the fact that I am testing large values of K relative to the sample size, causing all these classifiers to effectively behave identically (as opposed to a mistake in the code/package).
#13. APPLIED: The Boston Dataset Q: Using the Boston data set, fit classification models in order to predict whether a given suburb has a crime rate above or below the median. Explore logistic regression, LDA, and KNN models using various subsets of the predictors. Describe your findings.
A:
First, I’ll overwrite the crime rate variable (crim) with a binary factor variable, which takes the value 1 when crim is above the median, else 0.
The resulting dataset is shown below:
Boston$crim <- factor(ifelse(Boston$crim > median(as.numeric(Boston$crim)), 1, 0))
glimpse(Boston)
## Rows: 506
## Columns: 14
## $ crim <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1,...
## $ zn <dbl> 18.0, 0.0, 0.0, 0.0, 0.0, 0.0, 12.5, 12.5, 12.5, 12.5, 12.5...
## $ indus <dbl> 2.31, 7.07, 7.07, 2.18, 2.18, 2.18, 7.87, 7.87, 7.87, 7.87,...
## $ chas <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ nox <dbl> 0.538, 0.469, 0.469, 0.458, 0.458, 0.458, 0.524, 0.524, 0.5...
## $ rm <dbl> 6.575, 6.421, 7.185, 6.998, 7.147, 6.430, 6.012, 6.172, 5.6...
## $ age <dbl> 65.2, 78.9, 61.1, 45.8, 54.2, 58.7, 66.6, 96.1, 100.0, 85.9...
## $ dis <dbl> 4.0900, 4.9671, 4.9671, 6.0622, 6.0622, 6.0622, 5.5605, 5.9...
## $ rad <int> 1, 2, 2, 3, 3, 3, 5, 5, 5, 5, 5, 5, 5, 4, 4, 4, 4, 4, 4, 4,...
## $ tax <dbl> 296, 242, 242, 222, 222, 222, 311, 311, 311, 311, 311, 311,...
## $ ptratio <dbl> 15.3, 17.8, 17.8, 18.7, 18.7, 18.7, 15.2, 15.2, 15.2, 15.2,...
## $ black <dbl> 396.90, 396.90, 392.83, 394.63, 396.90, 394.12, 395.60, 396...
## $ lstat <dbl> 4.98, 9.14, 4.03, 2.94, 5.33, 5.21, 12.43, 19.15, 29.93, 17...
## $ medv <dbl> 24.0, 21.6, 34.7, 33.4, 36.2, 28.7, 22.9, 27.1, 16.5, 18.9,...
ctrl <- trainControl(method = "repeatedcv",
number = 10,
repeats = 5)
I start by fitting all 4 model types, using all predictors each time. I won’t use a train/test split, but will instead evaluate performance using cross-validation accuracy (10-fold, repeated 5 times).
Logistic Regression:
set.seed(1)
log_crim <- train(crim ~ .,
data = Boston,
method = "glm",
family = "binomial",
trControl = ctrl)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
log_crim
## Generalized Linear Model
##
## 506 samples
## 13 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 5 times)
## Summary of sample sizes: 456, 455, 456, 454, 456, 456, ...
## Resampling results:
##
## Accuracy Kappa
## 0.9028549 0.8057423
LDA:
set.seed(2)
lda_crim <- train(crim ~ .,
data = Boston,
method = "lda",
trControl = ctrl)
lda_crim
## Linear Discriminant Analysis
##
## 506 samples
## 13 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 5 times)
## Summary of sample sizes: 456, 455, 456, 456, 455, 454, ...
## Resampling results:
##
## Accuracy Kappa
## 0.8498998 0.6996644
QDA:
set.seed(3)
qda_crim <- train(crim ~ .,
data = Boston,
method = "qda",
trControl = ctrl)
qda_crim
## Quadratic Discriminant Analysis
##
## 506 samples
## 13 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 5 times)
## Summary of sample sizes: 456, 456, 455, 455, 455, 456, ...
## Resampling results:
##
## Accuracy Kappa
## 0.8894561 0.7788975
KNN:
set.seed(4)
knn_crim <- train(crim ~ .,
data = Boston,
method = "knn",
preProcess = c("center", "scale"),
trControl = ctrl,
tuneGrid = expand.grid(k = seq(1, 20, 2)))
knn_crim
## k-Nearest Neighbors
##
## 506 samples
## 13 predictor
## 2 classes: '0', '1'
##
## Pre-processing: centered (13), scaled (13)
## Resampling: Cross-Validated (10 fold, repeated 5 times)
## Summary of sample sizes: 455, 456, 456, 455, 455, 456, ...
## Resampling results across tuning parameters:
##
## k Accuracy Kappa
## 1 0.9145970 0.8292052
## 3 0.9181104 0.8362439
## 5 0.9153345 0.8306748
## 7 0.9047128 0.8094000
## 9 0.8779164 0.7557756
## 11 0.8664404 0.7327928
## 13 0.8664323 0.7327720
## 15 0.8703940 0.7407136
## 17 0.8684169 0.7367859
## 19 0.8648404 0.7296460
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 3.
KNN actually performed the best, which is surprising since I’ve performed no variable selection, and KNN can be quite negatively impacted by weak predictors.
To combat this and try to improve the model further, I apply the following approach:
Use caret::varImp() to rank order all 13 the predictors from best → worst Train a KNN model on different predictor subsets (1 best, 2 best, 3 best, …) Evaluate what combination of predictors (and corresponding value for k) performs the best with the target metric (5-repeated 10-fold cross-validation accuracy) Below are the indexed variable importance measures provided by caret:
varImp(knn_crim)
## ROC curve variable importance
##
## Importance
## nox 100.00
## dis 86.05
## age 82.64
## indus 80.85
## tax 78.00
## rad 73.48
## lstat 59.43
## medv 50.63
## zn 47.35
## ptratio 45.59
## black 42.68
## rm 20.66
## chas 0.00
By ordering the variables in the dataset based on their importance, I can fit, tune & evaluate all models in a single small loop. Below is the output:
knn_best_to_worst <- as.data.frame(varImp(knn_crim)$importance) %>%
rownames_to_column("var") %>%
arrange(desc(X0)) %>%
pull(var)
Boston <- Boston %>%
dplyr::select(c(knn_best_to_worst, "crim"))
## Note: Using an external vector in selections is ambiguous.
## i Use `all_of(knn_best_to_worst)` instead of `knn_best_to_worst` to silence this message.
## i See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This message is displayed once per session.
cv_accuracy <- c()
chosen_k <- c()
set.seed(5)
for (i in 1:13) {
knn_crim_temp <- as.data.frame(Boston[ ,c(1:i, 14)])
knn_crim_temp <- train(crim ~ .,
data = knn_crim_temp,
method = "knn",
preProcess = c("center", "scale"),
trControl = ctrl,
tuneGrid = expand.grid(k = seq(1, 20, 2)))
cv_accuracy[i] <- max(knn_crim_temp$results$Accuracy)
chosen_k[i] <- as.numeric(knn_crim_temp$bestTune)
}
data.frame(pred_num = 1:13, cv_accuracy = cv_accuracy, chosen_k = chosen_k) %>%
arrange(desc(cv_accuracy))
## pred_num cv_accuracy chosen_k
## 1 1 0.9549303 1
## 2 10 0.9407704 5
## 3 11 0.9407192 3
## 4 6 0.9378703 1
## 5 5 0.9356350 5
## 6 4 0.9340434 3
## 7 7 0.9340048 1
## 8 9 0.9324929 1
## 9 12 0.9320217 5
## 10 8 0.9316078 3
## 11 13 0.9170664 5
## 12 2 0.8960546 1
## 13 3 0.8877502 7
Interestingly, the most effective model (by some margin) is also probably the simplest model we could possibly create - a KNN model with p = 1 and K = 1. This model is so simple that implementing it in a train, test framework can be described in 2 lines of text:
“For each new test suburb in which we want to predict the binarized crime rate (crim), look in train and find the suburb closest to it in its nitrogen oxide concentration (nox) and use that suburbs binarized crime rate as the prediction”.