Overview

This document replicates Sections 4.1 (The Stock Market Data) and 4.2 (Logistic Regression) from Emil Hvitfeldt’s ISLR tidymodels labs.


Load Libraries

library(tidymodels)
library(ISLR)     # For the Smarket data set
library(ISLR2)    # For the Bikeshare data set
library(discrim)
library(poissonreg)
library(corrr)

4.1 The Stock Market Data

We will examine the Smarket data set, which contains a number of numeric variables plus a variable called Direction with labels "Up" and "Down".

Correlation Matrix

We use the corrr package to explore correlations between variables. We remove Direction since it is not numeric.

cor_Smarket <- Smarket %>%
  select(-Direction) %>%
  correlate()

cor_Smarket
## # A tibble: 8 × 9
##   term      Year     Lag1     Lag2     Lag3     Lag4     Lag5  Volume    Today
##   <chr>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>   <dbl>    <dbl>
## 1 Year   NA       0.0297   0.0306   0.0332   0.0357   0.0298   0.539   0.0301 
## 2 Lag1    0.0297 NA       -0.0263  -0.0108  -0.00299 -0.00567  0.0409 -0.0262 
## 3 Lag2    0.0306 -0.0263  NA       -0.0259  -0.0109  -0.00356 -0.0434 -0.0103 
## 4 Lag3    0.0332 -0.0108  -0.0259  NA       -0.0241  -0.0188  -0.0418 -0.00245
## 5 Lag4    0.0357 -0.00299 -0.0109  -0.0241  NA       -0.0271  -0.0484 -0.00690
## 6 Lag5    0.0298 -0.00567 -0.00356 -0.0188  -0.0271  NA       -0.0220 -0.0349 
## 7 Volume  0.539   0.0409  -0.0434  -0.0418  -0.0484  -0.0220  NA       0.0146 
## 8 Today   0.0301 -0.0262  -0.0103  -0.00245 -0.00690 -0.0349   0.0146 NA

Correlation Plot (rplot)

rplot(cor_Smarket, colours = c("indianred2", "black", "skyblue1"))
Correlation chart. Most values are very close to 0. Year and Volume appear quite correlated.

Correlation chart. Most values are very close to 0. Year and Volume appear quite correlated.

Most variables are nearly uncorrelated with each other. The notable exception is Year and Volume, which show a modest positive correlation.

Heatmap-Style Correlation Chart

cor_Smarket %>%
  stretch() %>%
  ggplot(aes(x, y, fill = r)) +
  geom_tile() +
  geom_text(aes(label = as.character(fashion(r)))) +
  scale_fill_gradient2(low = "skyblue1", mid = "pink", high = "purple2",
                       midpoint = 0, limits = c(-1, 1))
Heatmap correlation chart.

Heatmap correlation chart.

Year vs Volume

If we plot Year against Volume we can see an upward trend in trading volume over time.

ggplot(Smarket, aes(Year, Volume)) +
  geom_jitter(height = 0)
Jittered scatter plot of Year vs Volume showing upward trend.

Jittered scatter plot of Year vs Volume showing upward trend.


4.2 Logistic Regression

Model Specification

We use logistic_reg() from the parsnip package with the glm engine.

lr_spec <- logistic_reg() %>%
  set_engine("glm") %>%
  set_mode("classification")

Fit on Full Data

We model Direction based on the 5 previous days’ percentage returns plus volume of shares traded.

lr_fit <- lr_spec %>%
  fit(
    Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume,
    data = Smarket
  )

lr_fit
## parsnip model object
## 
## 
## Call:  stats::glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + 
##     Lag5 + Volume, family = stats::binomial, data = data)
## 
## Coefficients:
## (Intercept)         Lag1         Lag2         Lag3         Lag4         Lag5  
##   -0.126000    -0.073074    -0.042301     0.011085     0.009359     0.010313  
##      Volume  
##    0.135441  
## 
## Degrees of Freedom: 1249 Total (i.e. Null);  1243 Residual
## Null Deviance:       1731 
## Residual Deviance: 1728  AIC: 1742

GLM Summary

lr_fit %>%
  pluck("fit") %>%
  summary()
## 
## Call:
## stats::glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + 
##     Lag5 + Volume, family = stats::binomial, data = data)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.126000   0.240736  -0.523    0.601
## Lag1        -0.073074   0.050167  -1.457    0.145
## Lag2        -0.042301   0.050086  -0.845    0.398
## Lag3         0.011085   0.049939   0.222    0.824
## Lag4         0.009359   0.049974   0.187    0.851
## Lag5         0.010313   0.049511   0.208    0.835
## Volume       0.135441   0.158360   0.855    0.392
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1731.2  on 1249  degrees of freedom
## Residual deviance: 1727.6  on 1243  degrees of freedom
## AIC: 1741.6
## 
## Number of Fisher Scoring iterations: 3

Tidy Coefficients

tidy(lr_fit)
## # A tibble: 7 × 5
##   term        estimate std.error statistic p.value
##   <chr>          <dbl>     <dbl>     <dbl>   <dbl>
## 1 (Intercept) -0.126      0.241     -0.523   0.601
## 2 Lag1        -0.0731     0.0502    -1.46    0.145
## 3 Lag2        -0.0423     0.0501    -0.845   0.398
## 4 Lag3         0.0111     0.0499     0.222   0.824
## 5 Lag4         0.00936    0.0500     0.187   0.851
## 6 Lag5         0.0103     0.0495     0.208   0.835
## 7 Volume       0.135      0.158      0.855   0.392

Predictions on Training Data

Class Predictions

predict(lr_fit, new_data = Smarket)
## # A tibble: 1,250 × 1
##    .pred_class
##    <fct>      
##  1 Up         
##  2 Down       
##  3 Down       
##  4 Up         
##  5 Up         
##  6 Up         
##  7 Down       
##  8 Up         
##  9 Up         
## 10 Down       
## # ℹ 1,240 more rows

Probability Predictions

predict(lr_fit, new_data = Smarket, type = "prob")
## # A tibble: 1,250 × 2
##    .pred_Down .pred_Up
##         <dbl>    <dbl>
##  1      0.493    0.507
##  2      0.519    0.481
##  3      0.519    0.481
##  4      0.485    0.515
##  5      0.489    0.511
##  6      0.493    0.507
##  7      0.507    0.493
##  8      0.491    0.509
##  9      0.482    0.518
## 10      0.511    0.489
## # ℹ 1,240 more rows

Confusion Matrix (Full Data)

augment(lr_fit, new_data = Smarket) %>%
  conf_mat(truth = Direction, estimate = .pred_class)
##           Truth
## Prediction Down  Up
##       Down  145 141
##       Up    457 507

Heatmap Visualization of Confusion Matrix

augment(lr_fit, new_data = Smarket) %>%
  conf_mat(truth = Direction, estimate = .pred_class) %>%
  autoplot(type = "heatmap")
Confusion matrix heatmap. The model tends to over-predict 'Up'.

Confusion matrix heatmap. The model tends to over-predict ‘Up’.

Accuracy

augment(lr_fit, new_data = Smarket) %>%
  accuracy(truth = Direction, estimate = .pred_class)
## # A tibble: 1 × 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy binary         0.522

The in-sample accuracy is only ~52%, which is not impressive.


Train/Test Split by Year

To get a more realistic assessment, we split by year: train on all years except 2005, test on 2005.

Smarket_train <- Smarket %>%
  filter(Year != 2005)

Smarket_test <- Smarket %>%
  filter(Year == 2005)

dim(Smarket_train)
## [1] 998   9
dim(Smarket_test)
## [1] 252   9

Fit on Training Data (All Predictors)

lr_fit2 <- lr_spec %>%
  fit(
    Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume,
    data = Smarket_train
  )

Evaluate on Test Data

augment(lr_fit2, new_data = Smarket_test) %>%
  conf_mat(truth = Direction, estimate = .pred_class)
##           Truth
## Prediction Down Up
##       Down   77 97
##       Up     34 44
augment(lr_fit2, new_data = Smarket_test) %>%
  accuracy(truth = Direction, estimate = .pred_class)
## # A tibble: 1 × 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy binary         0.480

The out-of-sample accuracy drops to ~48%, worse than random chance — this model is not generalizing well.


Reduced Model: Only Lag1 and Lag2

Since most predictors had large p-values, we drop them and keep only Lag1 and Lag2.

lr_fit3 <- lr_spec %>%
  fit(
    Direction ~ Lag1 + Lag2,
    data = Smarket_train
  )

augment(lr_fit3, new_data = Smarket_test) %>%
  conf_mat(truth = Direction, estimate = .pred_class)
##           Truth
## Prediction Down  Up
##       Down   35  35
##       Up     76 106
augment(lr_fit3, new_data = Smarket_test) %>%
  accuracy(truth = Direction, estimate = .pred_class)
## # A tibble: 1 × 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy binary         0.560

Accuracy improves to ~56% with just two predictors — removing irrelevant variables reduces variance.


Predicting on New Data

We predict Direction for two hypothetical scenarios:

Scenario Lag1 Lag2
1 1.2 1.1
2 1.5 -0.8
Smarket_new <- tibble(
  Lag1 = c(1.2, 1.5),
  Lag2 = c(1.1, -0.8)
)

predict(
  lr_fit3,
  new_data = Smarket_new,
  type = "prob"
)
## # A tibble: 2 × 2
##   .pred_Down .pred_Up
##        <dbl>    <dbl>
## 1      0.521    0.479
## 2      0.504    0.496

Both scenarios predict a slightly higher probability of "Down" — consistent with the pattern observed in the data.


Summary

Section Key Finding
4.1 Year and Volume are the most correlated pair; other variables are nearly uncorrelated
4.2 (full model, in-sample) Accuracy ~52%; model is over-predicting "Up"
4.2 (full model, test) Accuracy drops to ~48% out-of-sample
4.2 (reduced model, test) Accuracy improves to ~56% using only Lag1 + Lag2

Conclusion: For the Smarket data, a simpler logistic regression model with fewer predictors generalizes better. However, predicting stock market direction remains inherently difficult.


Replicated from ISLR tidymodels labs by Emil Hvitfeldt