10. APPLIED: The Weekly Dataset (Logistic, LDA, QDA, KNN)

This question should be answered using the Weekly data set, which is part of the ISLR package. This data is similar in nature to the Smarket data from this chapter’s lab, except that it contains 1,089 weekly returns for 21 years, from the beginning of 1990 to the end of 2010.

glimpse(Weekly)

## Rows: 1,089
## Columns: 9
## $ Year      <dbl> 1990, 1990, 1990, 1990, 1990, 1990, 1990, 1990, 1990, 199...
## $ Lag1      <dbl> 0.816, -0.270, -2.576, 3.514, 0.712, 1.178, -1.372, 0.807...
## $ Lag2      <dbl> 1.572, 0.816, -0.270, -2.576, 3.514, 0.712, 1.178, -1.372...
## $ Lag3      <dbl> -3.936, 1.572, 0.816, -0.270, -2.576, 3.514, 0.712, 1.178...
## $ Lag4      <dbl> -0.229, -3.936, 1.572, 0.816, -0.270, -2.576, 3.514, 0.71...
## $ Lag5      <dbl> -3.484, -0.229, -3.936, 1.572, 0.816, -0.270, -2.576, 3.5...
## $ Volume    <dbl> 0.1549760, 0.1485740, 0.1598375, 0.1616300, 0.1537280, 0....
## $ Today     <dbl> -0.270, -2.576, 3.514, 0.712, 1.178, -1.372, 0.807, 0.041...
## $ Direction <fct> Down, Down, Up, Up, Up, Down, Up, Up, Up, Down, Down, Up,...

(a) Data Summary

Q: Produce some numerical and graphical summaries of the Weekly data. Do there appear to be any patterns?

An initial glance at the numeric variables of the dataset:

pairs(Weekly[ ,-9])

abs(cor(Weekly[ ,-9]))

##              Year        Lag1       Lag2       Lag3        Lag4        Lag5
## Year   1.00000000 0.032289274 0.03339001 0.03000649 0.031127923 0.030519101
## Lag1   0.03228927 1.000000000 0.07485305 0.05863568 0.071273876 0.008183096
## Lag2   0.03339001 0.074853051 1.00000000 0.07572091 0.058381535 0.072499482
## Lag3   0.03000649 0.058635682 0.07572091 1.00000000 0.075395865 0.060657175
## Lag4   0.03112792 0.071273876 0.05838153 0.07539587 1.000000000 0.075675027
## Lag5   0.03051910 0.008183096 0.07249948 0.06065717 0.075675027 1.000000000
## Volume 0.84194162 0.064951313 0.08551314 0.06928771 0.061074617 0.058517414
## Today  0.03245989 0.075031842 0.05916672 0.07124364 0.007825873 0.011012698
##            Volume       Today
## Year   0.84194162 0.032459894
## Lag1   0.06495131 0.075031842
## Lag2   0.08551314 0.059166717
## Lag3   0.06928771 0.071243639
## Lag4   0.06107462 0.007825873
## Lag5   0.05851741 0.011012698
## Volume 1.00000000 0.033077783
## Today  0.03307778 1.000000000

As we would expect with stock market data, there are no obvious strong relationships between the Lag variables. However, there do appear to be some interesting trends over time. I create the Week variable below, allowing for easier plotting of trends, since there is a chronology to the rows that is not shown fully through the Year variable.

I first do a quick sense-check that the rows are in the correct order, based on the definition of Today and Lag1:

Weekly %>%
  filter(lead(Lag1) != Today) %>%
  nrow()

## [1] 0

Since there are no rows out of order, the dataset appears to be correctly ordered in ascending weeks, so I create Week (basically a row counter):

Weekly$Week <- 1:nrow(Weekly)

Looking at Volume over time, there has been a significant increase in the volume of shares traded since the 90’s. This appears to have peaked around 2009, starting to decrease in 2010. it would be interesting to see the S&P 500 stats since then.

year_breaks <- Weekly %>%
  group_by(Year) %>%
  summarize(Week = min(Week))

## `summarise()` ungrouping output (override with `.groups` argument)

ggplot(Weekly, aes(x = Week, y = Volume)) + 
  geom_line() + 
  geom_smooth() + 
  scale_x_continuous(breaks = year_breaks$Week, minor_breaks = NULL, labels = year_breaks$Year) + 
  labs(title = "Average Daily Shares Traded vs Time", 
       x = "Time") + 
  theme_light()

## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

Here is Direction over time, which is less interesting. There appear to only be 4 years in which >= 50% of the weeks didn’t see a positive return (2000, 2001, 2002, 2008).

ggplot(Weekly, aes(x = Year, fill = Direction)) + 
  geom_bar(position = "fill") +
  geom_hline(yintercept = 0.5, col = "grey45") +
  scale_x_continuous(breaks = seq(1990, 2010), minor_breaks = NULL) +
  scale_y_continuous(labels = scales::percent_format()) +
  theme_light() + 
  theme(axis.title.y = element_blank(), 
        legend.position = "bottom") + 
  ggtitle("% of Up/Down Weeks vs Time")

The split of the weeks into Down & Up can be seen in the table below. We could get a classifier with 55.56% accuracy simply by predicting the S&P 500 return will be positive every week.

prop.table(table(Weekly$Direction))

## 
##      Down        Up 
## 0.4444444 0.5555556

We can also see that the market seems to go through periods of higher variance/instability. Crashes (e.g. Sept. 2008) stand out here.

ggplot(Weekly, aes(x = Week, y = Today / 100)) + 
  geom_line() + 
  scale_x_continuous(breaks = year_breaks$Week, minor_breaks = NULL, labels = year_breaks$Year) + 
  scale_y_continuous(labels = scales::percent_format(), breaks = seq(-0.2, 0.2, 0.05)) + 
  geom_hline(yintercept = 0, col = "grey55") +
  theme_light() + 
  labs(title = "Weekly Percentage Return vs Time", 
       x = "Time", 
       y = "Percentage Return")

(b) Logistic Regression (predict market Direction)

Q: Use the full data set to perform a logistic regression with Direction as the response and the five lag variables plus Volume as predictors. Use the summary function to print the results. Do any of the predictors appear to be statistically significant? If so, which ones?

Lag2 appears to be the only statistically significant predictor:

glm_dir <- glm(Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume, 
               data = Weekly, 
               family = "binomial")

summary(glm_dir)

## 
## Call:
## glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + 
##     Volume, family = "binomial", data = Weekly)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.6949  -1.2565   0.9913   1.0849   1.4579  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept)  0.26686    0.08593   3.106   0.0019 **
## Lag1        -0.04127    0.02641  -1.563   0.1181   
## Lag2         0.05844    0.02686   2.175   0.0296 * 
## Lag3        -0.01606    0.02666  -0.602   0.5469   
## Lag4        -0.02779    0.02646  -1.050   0.2937   
## Lag5        -0.01447    0.02638  -0.549   0.5833   
## Volume      -0.02274    0.03690  -0.616   0.5377   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1496.2  on 1088  degrees of freedom
## Residual deviance: 1486.4  on 1082  degrees of freedom
## AIC: 1500.4
## 
## Number of Fisher Scoring iterations: 4

Note that the output from the Z-statistic (z value) is calculated the same as with the T-test in linear regression: Z=βj^SE(βj), and a large absolute value indicates evidence against the null hypothesis H0:βj=0 (again, the same).

(c) Confusion Matrix

Q: Compute the confusion matrix and overall fraction of correct predictions. Explain what the confusion matrix is telling you about the types of mistakes made by logistic regression.

The confusion matrix is shown below. I use caret::confusionMatrix(), as it has various confusion matrix statistics built in to its output.

predicted <- factor(ifelse(predict(glm_dir, type = "response") < 0.5, "Down", "Up"))
confusionMatrix(predicted, Weekly$Direction, positive = "Up")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Down  Up
##       Down   54  48
##       Up    430 557
##                                          
##                Accuracy : 0.5611         
##                  95% CI : (0.531, 0.5908)
##     No Information Rate : 0.5556         
##     P-Value [Acc > NIR] : 0.369          
##                                          
##                   Kappa : 0.035          
##                                          
##  Mcnemar's Test P-Value : <2e-16         
##                                          
##             Sensitivity : 0.9207         
##             Specificity : 0.1116         
##          Pos Pred Value : 0.5643         
##          Neg Pred Value : 0.5294         
##              Prevalence : 0.5556         
##          Detection Rate : 0.5115         
##    Detection Prevalence : 0.9063         
##       Balanced Accuracy : 0.5161         
##                                          
##        'Positive' Class : Up             
##

This is just over the baseline accuracy (55.56%) achieved by a naive classifier (that predicts Up every time). In fact, this is almost the strategy of the logistic regression model:

prop.table(table(predicted))

## predicted
##       Down         Up 
## 0.09366391 0.90633609

This is reflected in the very poor specificity (it does not predict the negative class well).

Note also that we are dealing with training accuracy here, so this marginal accuracy improvement over the baseline is not interesting.

(d) train and test - Logistic Regression

Q: Now fit the logistic regression model using a training data period from 1990 to 2008, with Lag2 as the only predictor. Compute the confusion matrix and the overall fraction of correct predictions for the held out data (that is, the data from 2009 and 2010).

I create train and test, corresponding to the two time periods given in the question.

The confusion matrix is for the test predictions this time.

train <- Weekly[Weekly$Year <= 2008, ]
test <- Weekly[Weekly$Year > 2008, ]

glm_dir <- glm(Direction ~ Lag2, 
               data = train, 
               family = "binomial")

predicted <- factor(ifelse(predict(glm_dir, newdata = test, type = "response") < 0.5, "Down", "Up"))

confusionMatrix(predicted, test$Direction, positive = "Up")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Down Up
##       Down    9  5
##       Up     34 56
##                                          
##                Accuracy : 0.625          
##                  95% CI : (0.5247, 0.718)
##     No Information Rate : 0.5865         
##     P-Value [Acc > NIR] : 0.2439         
##                                          
##                   Kappa : 0.1414         
##                                          
##  Mcnemar's Test P-Value : 7.34e-06       
##                                          
##             Sensitivity : 0.9180         
##             Specificity : 0.2093         
##          Pos Pred Value : 0.6222         
##          Neg Pred Value : 0.6429         
##              Prevalence : 0.5865         
##          Detection Rate : 0.5385         
##    Detection Prevalence : 0.8654         
##       Balanced Accuracy : 0.5637         
##                                          
##        'Positive' Class : Up             
##

Here we get an Accuracy of 0.625.

The confusionMatrix() function provides a lot of other useful statistics. For example, No Information Rate : 0.5865 tells us that the largest class (Up) is 58.65% of the test observations, and hence this is our baseline for a classifier.

Clearly Accuracy : 0.625 > 0.5865, which is positive. However, our test dataset is relatively small so this might not be a meaningful improvement.

We are provided with a p-value for a one-sided test to see whether the accuracy is better than the “no information rate”. P-Value [Acc > NIR] : 0.2439 > 0.05 ⟹ no significant evidence that our classifier is better than the baseline strategy. Predicting stock movements is hard - who would’ve thought?

(e) train and test - LDA

Q: Repeat (d) using LDA.

lda_dir <- lda(Direction ~ Lag2, data = train)


predicted_lda <- predict(lda_dir, newdata = test)

confusionMatrix(data = predicted_lda$class,
                reference = test$Direction, 
                positive = "Up")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Down Up
##       Down    9  5
##       Up     34 56
##                                          
##                Accuracy : 0.625          
##                  95% CI : (0.5247, 0.718)
##     No Information Rate : 0.5865         
##     P-Value [Acc > NIR] : 0.2439         
##                                          
##                   Kappa : 0.1414         
##                                          
##  Mcnemar's Test P-Value : 7.34e-06       
##                                          
##             Sensitivity : 0.9180         
##             Specificity : 0.2093         
##          Pos Pred Value : 0.6222         
##          Neg Pred Value : 0.6429         
##              Prevalence : 0.5865         
##          Detection Rate : 0.5385         
##    Detection Prevalence : 0.8654         
##       Balanced Accuracy : 0.5637         
##                                          
##        'Positive' Class : Up             
##

Here we get an Accuracy of 0.625. Note that, as before, we have P-Value [Acc > NIR] : 0.2439 > 0.05, so whilst the accuracy of the classifier is 0.625 > 0.5865, the test sample size is not large enough for this increase over the baseline to be meaningful.

predict.lda()\(posterior gives a data frame of probability predictions, with one column per response class. predict.lda()\)classgives the class prediction for each observation (the class with the greatest probability). I check this below:

identical(as.character(predicted_lda$class), 
as.character(ifelse(predicted_lda$posterior[ ,2] < 0.5, "Down", "Up")))

## [1] TRUE

(f) train and test - QDA

Q: Repeat (d) using QDA.

qda_dir <- qda(Direction ~ Lag2, data = train)


predicted_qda <- predict(qda_dir, newdata = test)

confusionMatrix(data = predicted_qda$class, 
                reference = test$Direction, 
                positive = "Up")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Down Up
##       Down    0  0
##       Up     43 61
##                                           
##                Accuracy : 0.5865          
##                  95% CI : (0.4858, 0.6823)
##     No Information Rate : 0.5865          
##     P-Value [Acc > NIR] : 0.5419          
##                                           
##                   Kappa : 0               
##                                           
##  Mcnemar's Test P-Value : 1.504e-10       
##                                           
##             Sensitivity : 1.0000          
##             Specificity : 0.0000          
##          Pos Pred Value : 0.5865          
##          Neg Pred Value :    NaN          
##              Prevalence : 0.5865          
##          Detection Rate : 0.5865          
##    Detection Prevalence : 1.0000          
##       Balanced Accuracy : 0.5000          
##                                           
##        'Positive' Class : Up              
##

Here we get an Accuracy of 0.5865.

Note that the QDA classifier just predicts Up for every test observation - it behaves identically to the naive classifier on this dataset, with a sensitivity of 1 and a specificity of 0.

(g) train and test - KNN (K = 1)

Q: Repeat (d) using KNN with K = 1.

Usually as a pre-processing step for KNN (with multiple predictors over different scales), we would want to standardize the predictors (xnew=x−μσ) so that each xnew will have a mean of 0 and standard deviation of 1. In this case, however, there is only 1 predictor (Lag2), so the nearest neighbour would not be effected by this.

Note also that there is some element of randomness with the KNN classifier. Take the following test observation that requires prediction:

test[100, "Lag2"]

## [1] 0.043

Notice that there are two train observations of identical distance (w.r.t Lag2), but both have different Direction values:

train[c(10, 808), c("Lag2", "Direction")]

##      Lag2 Direction
## 10  0.041      Down
## 808 0.041        Up

In this case, the KNN probability will be 0.5:

set.seed(1)
predicted_knn <- knn(train = data.frame(Lag2 = train$Lag2), 
                  test = data.frame(Lag2 = test$Lag2), 
                  cl = train$Direction, 
                  k = 1, 
                  prob = T)

attr(predicted_knn, "prob")[100]

## [1] 0.5

However, the classifier must make a prediction, which will just be chosen at random:

predicted_knn[100]

## [1] Down
## Levels: Down Up

See the confusion matrix below:

confusionMatrix(data = predicted_knn, 
                reference = test$Direction, 
                positive = "Up")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Down Up
##       Down   21 30
##       Up     22 31
##                                           
##                Accuracy : 0.5             
##                  95% CI : (0.4003, 0.5997)
##     No Information Rate : 0.5865          
##     P-Value [Acc > NIR] : 0.9700          
##                                           
##                   Kappa : -0.0033         
##                                           
##  Mcnemar's Test P-Value : 0.3317          
##                                           
##             Sensitivity : 0.5082          
##             Specificity : 0.4884          
##          Pos Pred Value : 0.5849          
##          Neg Pred Value : 0.4118          
##              Prevalence : 0.5865          
##          Detection Rate : 0.2981          
##    Detection Prevalence : 0.5096          
##       Balanced Accuracy : 0.4983          
##                                           
##        'Positive' Class : Up              
##

Here we get an accuracy of 0.5, which is again worse than the baseline.

(h) Best Performing Classifier?

Q: Which of these methods appears to provide the best results on this data?

Taking the target metric as the accuracy of the classifier: LDA & Logistic Regression get the same test accuracy of 0.625, so these two are tied.

i) Experimenting (combined predictors, interactions, transformations, etc.)

Q: Experiment with different combinations of predictors, including possible transformations and interactions, for each of the methods. Report the variables, method, and associated confusion matrix that appears to provide the best results on the held out data. Note that you should also experiment with values for K in the KNN classifier.

KNN - selecting best K (using cross-validation):

train$Today <- NULL

ctrl <- trainControl(method = "repeatedcv",
                     number = 5,
                     repeats = 5)

set.seed(111)

knn_train <- train(y = train$Direction,
                   x = train[ ,-8],
                   method = "knn",
                   metric = "Accuracy",
                   preProcess = c("center", "scale"),
                   tuneGrid = expand.grid(k = seq(1, 50, 2)),
                   trControl = ctrl)

caret::varImp(knn_train)

## ROC curve variable importance
## 
##        Importance
## Lag1      100.000
## Lag2       77.256
## Lag5       64.309
## Year       45.659
## Volume     43.735
## Week       42.513
## Lag4        4.578
## Lag3        0.000

knn_train

## k-Nearest Neighbors 
## 
## 985 samples
##   8 predictor
##   2 classes: 'Down', 'Up' 
## 
## Pre-processing: centered (8), scaled (8) 
## Resampling: Cross-Validated (5 fold, repeated 5 times) 
## Summary of sample sizes: 788, 788, 788, 788, 788, 787, ... 
## Resampling results across tuning parameters:
## 
##   k   Accuracy   Kappa       
##    1  0.4947996  -0.020756148
##    3  0.5232477   0.031373990
##    5  0.5230292   0.028769372
##    7  0.5372435   0.053353378
##    9  0.5372415   0.049612758
##   11  0.5343834   0.040819116
##   13  0.5346081   0.037598994
##   15  0.5368437   0.040363712
##   17  0.5317675   0.026161027
##   19  0.5301555   0.021608137
##   21  0.5370622   0.033488872
##   23  0.5330064   0.024861594
##   25  0.5401182   0.036317004
##   27  0.5376693   0.029998090
##   29  0.5321890   0.015868622
##   31  0.5338310   0.018687256
##   33  0.5360563   0.021431639
##   35  0.5322138   0.012669022
##   37  0.5283538   0.002535516
##   39  0.5330239   0.011432961
##   41  0.5289589   0.002400751
##   43  0.5295618   0.003081904
##   45  0.5299700   0.003587628
##   47  0.5303854   0.003745293
##   49  0.5320056   0.005552184
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 25.

ggplot(knn_train) +
  geom_smooth() +
  theme_light() +
  scale_y_continuous(labels = scales::percent_format()) +
  ggtitle("KNN - 'K' Selection (5-repeated 5-fold cross-validation)")

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

caret automatically chooses the best value for k. Evaluating the performance of this new model on test:

knn_pred <- predict(knn_train, newdata = test)

confusionMatrix(data = knn_pred, 
                reference = test$Direction, 
                positive = "Up")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Down Up
##       Down   19 20
##       Up     24 41
##                                           
##                Accuracy : 0.5769          
##                  95% CI : (0.4761, 0.6732)
##     No Information Rate : 0.5865          
##     P-Value [Acc > NIR] : 0.6193          
##                                           
##                   Kappa : 0.1156          
##                                           
##  Mcnemar's Test P-Value : 0.6511          
##                                           
##             Sensitivity : 0.6721          
##             Specificity : 0.4419          
##          Pos Pred Value : 0.6308          
##          Neg Pred Value : 0.4872          
##              Prevalence : 0.5865          
##          Detection Rate : 0.3942          
##    Detection Prevalence : 0.6250          
##       Balanced Accuracy : 0.5570          
##                                           
##        'Positive' Class : Up              
##

11. APPLIED: The Auto Dataset (LDA, QDA, Logistic, KNN)

n this problem, you will develop a model to predict whether a given car gets high or low gas mileage based on the Auto data set.

glimpse(Auto)

## Rows: 392
## Columns: 9
## $ mpg          <dbl> 18, 15, 18, 16, 17, 15, 14, 14, 14, 15, 15, 14, 15, 14...
## $ cylinders    <dbl> 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 4, 6, 6, 6, ...
## $ displacement <dbl> 307, 350, 318, 304, 302, 429, 454, 440, 455, 390, 383,...
## $ horsepower   <dbl> 130, 165, 150, 150, 140, 198, 220, 215, 225, 190, 170,...
## $ weight       <dbl> 3504, 3693, 3436, 3433, 3449, 4341, 4354, 4312, 4425, ...
## $ acceleration <dbl> 12.0, 11.5, 11.0, 12.0, 10.5, 10.0, 9.0, 8.5, 10.0, 8....
## $ year         <dbl> 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70...
## $ origin       <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 3, 1, 1, 1, ...
## $ name         <fct> chevrolet chevelle malibu, buick skylark 320, plymouth...

Q: Create a binary variable, mpg01, that contains a 1 if mpg contains a value above its median, and a 0 if mpg contains a value below its median. You can compute the median using the median() function. Note you may find it helpful to use the data.frame() function to create a single data set containing both mpg01 and the other Auto variables.

I’ll just add the mpg01 variable to the existing Auto dataset.

Auto$mpg01 <- factor(as.numeric(Auto$mpg > median(Auto$mpg)))

table(Auto$mpg01)

## 
##   0   1 
## 196 196

Visualizing Relationships with mpg01 Q: Explore the data graphically in order to investigate the association between mpg01 and the other features. Which of the other features seem most likely to be useful in predicting mpg01? Scatterplots and boxplots may be useful tools to answer this question. Describe your findings.

Excluding name (categorical with 301 unique values) and mpg (used to create mpg01), and converting origin to a factor:

Auto$name <- NULL
Auto$mpg <- NULL

Auto$origin <- factor(Auto$origin, labels = c("American", "European", "Japanese"))

glimpse(Auto)

## Rows: 392
## Columns: 8
## $ cylinders    <dbl> 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 4, 6, 6, 6, ...
## $ displacement <dbl> 307, 350, 318, 304, 302, 429, 454, 440, 455, 390, 383,...
## $ horsepower   <dbl> 130, 165, 150, 150, 140, 198, 220, 215, 225, 190, 170,...
## $ weight       <dbl> 3504, 3693, 3436, 3433, 3449, 4341, 4354, 4312, 4425, ...
## $ acceleration <dbl> 12.0, 11.5, 11.0, 12.0, 10.5, 10.0, 9.0, 8.5, 10.0, 8....
## $ year         <dbl> 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70...
## $ origin       <fct> American, American, American, American, American, Amer...
## $ mpg01        <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, ...

I chose a jitter plot (scatter plot with some random noise added) for the relationship between mpg01 & cylinders, since cylinders is discrete with only 3 unique values, so a box plot wasn’t very informative.

g1 <- ggplot(Auto, aes(x = mpg01, y = cylinders, col = mpg01)) + 
  geom_jitter() + 
  theme(legend.position = "none") + 
  ggtitle("Cylinders vs mpg01 - Jitter Plot") 

g2 <- ggplot(Auto, aes(x = mpg01, y = displacement, fill = mpg01)) + 
  geom_boxplot() + 
  theme(legend.position = "none") + 
  ggtitle("Displacement vs mpg01 - Boxplot")

g3 <- ggplot(Auto, aes(x = mpg01, y = horsepower, fill = mpg01)) + 
  geom_boxplot() + 
  theme(legend.position = "none") + 
  ggtitle("Horsepower vs mpg01 - Boxplot")

g4 <- ggplot(Auto, aes(x = mpg01, y = weight, fill = mpg01)) + 
  geom_boxplot() + 
  theme(legend.position = "none") + 
  ggtitle("Weight vs mpg01 - Boxplot")

g5 <- ggplot(Auto, aes(x = mpg01, y = acceleration, fill = mpg01)) + 
  geom_boxplot() + 
  theme(legend.position = "none") + 
  ggtitle("Acceleration vs mpg01 - Boxplot")

g6 <- ggplot(Auto, aes(x = mpg01, y = year, fill = mpg01)) + 
  geom_boxplot() + 
  theme(legend.position = "none") + 
  ggtitle("Year vs mpg01 - Boxplot")

g7 <- ggplot(Auto, aes(x = origin, fill = mpg01)) + 
  geom_bar(position = "fill") + 
  scale_y_continuous(labels = scales::percent_format()) + 
  theme(axis.title.y = element_blank()) + 
  ggtitle("Origin vs mpg01 - Bar Plot") 


grid.arrange(g1, g2, g3, g4, g5, g6, g7, 
             ncol = 2, 
             nrow = 4)

It’s tricky to compare predictive power between different graphs (e.g. scatter plots & box plots), but if I had to guess I would say that cylinders, displacement & weight are the three strongest predictors of mpg01.

horsepower looks like it might be a little weaker, with year & acceleration weaker still (more significant overlap of the distributions).

It is harder to judge how strong a predictor origin is visually when compared with the other variables, since we are looking at a relationship between two nominal categorical variables, but it certainly appears to be on the weaker side.

train & test Q: Split the data into a training set and a test set.

I create train & test from the Auto dataset. I leave name & mpg excluded for now. train makes up 50% of the data:

set.seed(444)
index <- createDataPartition(y = Auto$mpg01, p = 0.5, list = F)

train <- Auto[index, ]
test <- Auto[-index, ]

nrow(train) / nrow(Auto)

## [1] 0.5

(d) Predicting mpg01 - LDA

Q: Perform LDA on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01 in (b). What is the test error of the model obtained?

For (d), (e), (f) & (g), I’ll use the top 4 strongest predictors I identified visually (cylinders, displacement, weight & horsepower).

Furthermore, as these questions are very similar to question 10 (b) - (g), I thought I would investigate how to train all of these models using train() from the caret package instead. This means we can easily report two error metrics too (the train cross-validation error, and the test error).

Unlike in the last question, with this dataset we are performing far better than the baseline classifier accuracy of 50%.

ctrl <- trainControl(method = "repeatedcv", number = 10, repeats = 3)

set.seed(1)

lda_mpg <- train(mpg01 ~ cylinders + displacement + weight + horsepower, 
                 data = train, 
                 method = "lda", 
                 trControl = ctrl)

train (cross-validation) error:

1 - lda_mpg$results$Accuracy # cv error

## [1] 0.09125731

test error:

predicted_lda <- predict(lda_mpg, newdata = test, type = "raw") # as opposed to type = "prob"

mean(predicted_lda != test$mpg01)

## [1] 0.122449

(e) Predicting mpg01 - QDA

Q: Perform QDA on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01 in (b). What is the test error of the model obtained?

set.seed(2)

qda_mpg <- train(mpg01 ~ cylinders + displacement + weight + horsepower, 
                 data = train, 
                 method = "qda", 
                 trControl = ctrl)

train (cross-validation) error:

1 - qda_mpg$results$Accuracy

## [1] 0.08991228

test error:

predicted_qda <- predict(qda_mpg, newdata = test, type = "raw")

mean(predicted_qda != test$mpg01)

## [1] 0.1173469

In this case, QDA appears to perform better than LDA with respect to test error, and slightly better in terms of CV error.

(f) Predicting mpg01 - Logistic Regression

Q: Perform logistic regression on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01 in (b). What is the test error of the model obtained?

set.seed(3)

log_mpg <- train(mpg01 ~ cylinders + displacement + weight + horsepower, 
                 data = train, 
                 method = "glm", 
                 family = "binomial", 
                 trControl = ctrl)

train (cross-validation) error:

1 - log_mpg$results$Accuracy

## [1] 0.1084405

test error:

predicted_log <- predict(log_mpg, newdata = test, type = "raw")

mean(predicted_log != test$mpg01)

## [1] 0.1173469

Logistic Regression performs better than LDA & QDA with respect to CV error & test error.

(g) Predicting mpg01 - KNN

Q: Perform KNN on the training data, with several values of K, in order to predict mpg01. Use only the variables that seemed most associated with mpg01 in (b). What test errors do you obtain? Which value of K seems to perform the best on this data set?

I pass a grid of hyperparameters (values for K) to caret::train(). caret will test each value for K using the cross-validation scheme defined by ctrl (3-repeated 10-fold cross-validation), automatically select the best value for K with respect to a target metric (out-of-sample accuracy), then retrain the KNN model on the full train dataset with this optimal value for K.

Specifying preProcess = c(“center”, “scale”) centers and scales (standardizes) the 4 predictors, which is necessary for distance-based methods such as KNN where the predictors are over different scales.

set.seed(4)

knn_mpg <- train(mpg01 ~ cylinders + displacement + weight + horsepower, 
                 data = train, 
                 method = "knn", 
                 preProcess = c("center", "scale"), 
                 trControl = ctrl, 
                 tuneGrid = expand.grid(k = seq(1, 85, 3)))

knn_mpg

## k-Nearest Neighbors 
## 
## 196 samples
##   4 predictor
##   2 classes: '0', '1' 
## 
## Pre-processing: centered (4), scaled (4) 
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 176, 177, 176, 176, 176, 176, ... 
## Resampling results across tuning parameters:
## 
##   k   Accuracy   Kappa    
##    1  0.8779532  0.7554422
##    4  0.8999805  0.7995964
##    7  0.9033138  0.8062631
##   10  0.9033138  0.8062631
##   13  0.9033138  0.8062631
##   16  0.9033138  0.8062631
##   19  0.9033138  0.8062631
##   22  0.9033138  0.8062631
##   25  0.9033138  0.8062631
##   28  0.9033138  0.8062631
##   31  0.9033138  0.8062631
##   34  0.9033138  0.8062631
##   37  0.9033138  0.8062631
##   40  0.9033138  0.8062631
##   43  0.9033138  0.8062631
##   46  0.9033138  0.8062631
##   49  0.9033138  0.8062631
##   52  0.9033138  0.8062631
##   55  0.9033138  0.8062631
##   58  0.9033138  0.8062631
##   61  0.9033138  0.8062631
##   64  0.9049805  0.8095964
##   67  0.9049805  0.8095964
##   70  0.9066472  0.8129297
##   73  0.9100682  0.8197621
##   76  0.9083138  0.8162631
##   79  0.9084016  0.8164288
##   82  0.9067349  0.8130955
##   85  0.9034016  0.8065072
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 73.

train (cross-validation) error:

1 - max(knn_mpg$results$Accuracy)

## [1] 0.08993177

test error:

predicted_knn <- predict(knn_mpg, newdata = test, type = "raw")

mean(predicted_knn != test$mpg01)

## [1] 0.1122449

The optimal value chosen was K = 7, although this model performed close to Logistic Regression in CV error, and slightly worse in test error.

There appears to be some quirk with the data where all classifiers with mid-large values for K (from 10 to 73) seem to be getting the same cross-validation accuracy scores, presumably because they are making identical predictions.

I think identifying why exactly this is happening would require some investigation, but I think it is likely due to the fact that I am testing large values of K relative to the sample size, causing all these classifiers to effectively behave identically (as opposed to a mistake in the code/package).

#13. APPLIED: The Boston Dataset Q: Using the Boston data set, fit classification models in order to predict whether a given suburb has a crime rate above or below the median. Explore logistic regression, LDA, and KNN models using various subsets of the predictors. Describe your findings.

First, I’ll overwrite the crime rate variable (crim) with a binary factor variable, which takes the value 1 when crim is above the median, else 0.

The resulting dataset is shown below:

Boston$crim <- factor(ifelse(Boston$crim > median(as.numeric(Boston$crim)), 1, 0))


glimpse(Boston)

## Rows: 506
## Columns: 14
## $ crim    <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1,...
## $ zn      <dbl> 18.0, 0.0, 0.0, 0.0, 0.0, 0.0, 12.5, 12.5, 12.5, 12.5, 12.5...
## $ indus   <dbl> 2.31, 7.07, 7.07, 2.18, 2.18, 2.18, 7.87, 7.87, 7.87, 7.87,...
## $ chas    <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ nox     <dbl> 0.538, 0.469, 0.469, 0.458, 0.458, 0.458, 0.524, 0.524, 0.5...
## $ rm      <dbl> 6.575, 6.421, 7.185, 6.998, 7.147, 6.430, 6.012, 6.172, 5.6...
## $ age     <dbl> 65.2, 78.9, 61.1, 45.8, 54.2, 58.7, 66.6, 96.1, 100.0, 85.9...
## $ dis     <dbl> 4.0900, 4.9671, 4.9671, 6.0622, 6.0622, 6.0622, 5.5605, 5.9...
## $ rad     <int> 1, 2, 2, 3, 3, 3, 5, 5, 5, 5, 5, 5, 5, 4, 4, 4, 4, 4, 4, 4,...
## $ tax     <dbl> 296, 242, 242, 222, 222, 222, 311, 311, 311, 311, 311, 311,...
## $ ptratio <dbl> 15.3, 17.8, 17.8, 18.7, 18.7, 18.7, 15.2, 15.2, 15.2, 15.2,...
## $ black   <dbl> 396.90, 396.90, 392.83, 394.63, 396.90, 394.12, 395.60, 396...
## $ lstat   <dbl> 4.98, 9.14, 4.03, 2.94, 5.33, 5.21, 12.43, 19.15, 29.93, 17...
## $ medv    <dbl> 24.0, 21.6, 34.7, 33.4, 36.2, 28.7, 22.9, 27.1, 16.5, 18.9,...

ctrl <- trainControl(method = "repeatedcv",
                     number = 10,
                     repeats = 5)

I start by fitting all 4 model types, using all predictors each time. I won’t use a train/test split, but will instead evaluate performance using cross-validation accuracy (10-fold, repeated 5 times).

Logistic Regression:

set.seed(1)
log_crim <- train(crim ~ ., 
                 data = Boston, 
                 method = "glm", 
                 family = "binomial", 
                 trControl = ctrl)

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

log_crim

## Generalized Linear Model 
## 
## 506 samples
##  13 predictor
##   2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 5 times) 
## Summary of sample sizes: 456, 455, 456, 454, 456, 456, ... 
## Resampling results:
## 
##   Accuracy   Kappa    
##   0.9028549  0.8057423

LDA:

set.seed(2)
lda_crim <- train(crim ~ ., 
                 data = Boston, 
                 method = "lda", 
                 trControl = ctrl)

lda_crim

## Linear Discriminant Analysis 
## 
## 506 samples
##  13 predictor
##   2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 5 times) 
## Summary of sample sizes: 456, 455, 456, 456, 455, 454, ... 
## Resampling results:
## 
##   Accuracy   Kappa    
##   0.8498998  0.6996644

QDA:

set.seed(3)
qda_crim <- train(crim ~ ., 
                 data = Boston, 
                 method = "qda", 
                 trControl = ctrl)

qda_crim

## Quadratic Discriminant Analysis 
## 
## 506 samples
##  13 predictor
##   2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 5 times) 
## Summary of sample sizes: 456, 456, 455, 455, 455, 456, ... 
## Resampling results:
## 
##   Accuracy   Kappa    
##   0.8894561  0.7788975

KNN:

set.seed(4)
knn_crim <- train(crim ~ ., 
                 data = Boston, 
                 method = "knn", 
                 preProcess = c("center", "scale"), 
                 trControl = ctrl, 
                 tuneGrid = expand.grid(k = seq(1, 20, 2)))

knn_crim

## k-Nearest Neighbors 
## 
## 506 samples
##  13 predictor
##   2 classes: '0', '1' 
## 
## Pre-processing: centered (13), scaled (13) 
## Resampling: Cross-Validated (10 fold, repeated 5 times) 
## Summary of sample sizes: 455, 456, 456, 455, 455, 456, ... 
## Resampling results across tuning parameters:
## 
##   k   Accuracy   Kappa    
##    1  0.9145970  0.8292052
##    3  0.9181104  0.8362439
##    5  0.9153345  0.8306748
##    7  0.9047128  0.8094000
##    9  0.8779164  0.7557756
##   11  0.8664404  0.7327928
##   13  0.8664323  0.7327720
##   15  0.8703940  0.7407136
##   17  0.8684169  0.7367859
##   19  0.8648404  0.7296460
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 3.

KNN actually performed the best, which is surprising since I’ve performed no variable selection, and KNN can be quite negatively impacted by weak predictors.

To combat this and try to improve the model further, I apply the following approach:

Use caret::varImp() to rank order all 13 the predictors from best → worst Train a KNN model on different predictor subsets (1 best, 2 best, 3 best, …) Evaluate what combination of predictors (and corresponding value for k) performs the best with the target metric (5-repeated 10-fold cross-validation accuracy) Below are the indexed variable importance measures provided by caret:

varImp(knn_crim)

## ROC curve variable importance
## 
##         Importance
## nox         100.00
## dis          86.05
## age          82.64
## indus        80.85
## tax          78.00
## rad          73.48
## lstat        59.43
## medv         50.63
## zn           47.35
## ptratio      45.59
## black        42.68
## rm           20.66
## chas          0.00

By ordering the variables in the dataset based on their importance, I can fit, tune & evaluate all models in a single small loop. Below is the output:

knn_best_to_worst <- as.data.frame(varImp(knn_crim)$importance) %>%
  rownames_to_column("var") %>%
  arrange(desc(X0)) %>%
  pull(var)

Boston <- Boston %>%
  dplyr::select(c(knn_best_to_worst, "crim"))

## Note: Using an external vector in selections is ambiguous.
## i Use `all_of(knn_best_to_worst)` instead of `knn_best_to_worst` to silence this message.
## i See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This message is displayed once per session.

cv_accuracy <- c()
chosen_k <- c()

set.seed(5)

for (i in 1:13) {
  knn_crim_temp <- as.data.frame(Boston[ ,c(1:i, 14)])
  knn_crim_temp <- train(crim ~ ., 
                         data = knn_crim_temp,
                         method = "knn", 
                         preProcess = c("center", "scale"), 
                         trControl = ctrl, 
                         tuneGrid = expand.grid(k = seq(1, 20, 2)))
  
  cv_accuracy[i] <- max(knn_crim_temp$results$Accuracy)
  chosen_k[i] <- as.numeric(knn_crim_temp$bestTune)
}

data.frame(pred_num = 1:13, cv_accuracy = cv_accuracy, chosen_k = chosen_k) %>%
  arrange(desc(cv_accuracy))

##    pred_num cv_accuracy chosen_k
## 1         1   0.9549303        1
## 2        10   0.9407704        5
## 3        11   0.9407192        3
## 4         6   0.9378703        1
## 5         5   0.9356350        5
## 6         4   0.9340434        3
## 7         7   0.9340048        1
## 8         9   0.9324929        1
## 9        12   0.9320217        5
## 10        8   0.9316078        3
## 11       13   0.9170664        5
## 12        2   0.8960546        1
## 13        3   0.8877502        7

Interestingly, the most effective model (by some margin) is also probably the simplest model we could possibly create - a KNN model with p = 1 and K = 1. This model is so simple that implementing it in a train, test framework can be described in 2 lines of text:

“For each new test suburb in which we want to predict the binarized crime rate (crim), look in train and find the suburb closest to it in its nitrogen oxide concentration (nox) and use that suburbs binarized crime rate as the prediction”.

Assignment #3

Sanyogita Apte - jlh562

3/4/2021

10. APPLIED: The Weekly Dataset (Logistic, LDA, QDA, KNN)

(a) Data Summary

(b) Logistic Regression (predict market Direction)

(c) Confusion Matrix

(d) train and test - Logistic Regression

(e) train and test - LDA

(f) train and test - QDA

(g) train and test - KNN (K = 1)

(h) Best Performing Classifier?

i) Experimenting (combined predictors, interactions, transformations, etc.)

11. APPLIED: The Auto Dataset (LDA, QDA, Logistic, KNN)

(d) Predicting mpg01 - LDA

(e) Predicting mpg01 - QDA

(f) Predicting mpg01 - Logistic Regression

(g) Predicting mpg01 - KNN