Overview

This document replicates Sections 4.1 (The Stock Market Data) and 4.2 (Logistic Regression) from the ISLR tidymodels labs – Chapter 4 Classification.

We use the Smarket dataset from the ISLR package, which contains daily percentage returns for the S&P 500 from 2001 to 2005, and try to predict whether the market went Up or Down on a given day.


Load Libraries

library(tidymodels)   # Core modeling framework
library(ISLR)         # Smarket dataset
library(ISLR2)        # Additional datasets
library(discrim)      # LDA / QDA / Naive Bayes models
library(poissonreg)   # Poisson regression
library(corrr)        # Correlation analysis
library(paletteer)    # Color palettes

4.1 The Stock Market Data

Explore the Dataset

glimpse(Smarket)
## Rows: 1,250
## Columns: 9
## $ Year      <dbl> 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, …
## $ Lag1      <dbl> 0.381, 0.959, 1.032, -0.623, 0.614, 0.213, 1.392, -0.403, 0.…
## $ Lag2      <dbl> -0.192, 0.381, 0.959, 1.032, -0.623, 0.614, 0.213, 1.392, -0…
## $ Lag3      <dbl> -2.624, -0.192, 0.381, 0.959, 1.032, -0.623, 0.614, 0.213, 1…
## $ Lag4      <dbl> -1.055, -2.624, -0.192, 0.381, 0.959, 1.032, -0.623, 0.614, …
## $ Lag5      <dbl> 5.010, -1.055, -2.624, -0.192, 0.381, 0.959, 1.032, -0.623, …
## $ Volume    <dbl> 1.1913, 1.2965, 1.4112, 1.2760, 1.2057, 1.3491, 1.4450, 1.40…
## $ Today     <dbl> 0.959, 1.032, -0.623, 0.614, 0.213, 1.392, -0.403, 0.027, 1.…
## $ Direction <fct> Up, Up, Down, Up, Up, Up, Down, Up, Up, Up, Down, Down, Up, …

The dataset has 1250 rows and 9 columns.
Key variables:

  • Lag1Lag5: Percentage returns for the previous 1–5 days
  • Volume: Number of shares traded (in billions)
  • Today: Percentage return for today
  • Direction: Whether the market went Up or Down (our response variable)
head(Smarket)
##   Year   Lag1   Lag2   Lag3   Lag4   Lag5 Volume  Today Direction
## 1 2001  0.381 -0.192 -2.624 -1.055  5.010 1.1913  0.959        Up
## 2 2001  0.959  0.381 -0.192 -2.624 -1.055 1.2965  1.032        Up
## 3 2001  1.032  0.959  0.381 -0.192 -2.624 1.4112 -0.623      Down
## 4 2001 -0.623  1.032  0.959  0.381 -0.192 1.2760  0.614        Up
## 5 2001  0.614 -0.623  1.032  0.959  0.381 1.2057  0.213        Up
## 6 2001  0.213  0.614 -0.623  1.032  0.959 1.3491  1.392        Up

Correlation Analysis

We remove Direction (not numeric) before computing correlations.

cor_Smarket <- Smarket %>%
  select(-Direction) %>%
  correlate()

cor_Smarket
## # A tibble: 8 × 9
##   term      Year     Lag1     Lag2     Lag3     Lag4     Lag5  Volume    Today
##   <chr>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>   <dbl>    <dbl>
## 1 Year   NA       0.0297   0.0306   0.0332   0.0357   0.0298   0.539   0.0301 
## 2 Lag1    0.0297 NA       -0.0263  -0.0108  -0.00299 -0.00567  0.0409 -0.0262 
## 3 Lag2    0.0306 -0.0263  NA       -0.0259  -0.0109  -0.00356 -0.0434 -0.0103 
## 4 Lag3    0.0332 -0.0108  -0.0259  NA       -0.0241  -0.0188  -0.0418 -0.00245
## 5 Lag4    0.0357 -0.00299 -0.0109  -0.0241  NA       -0.0271  -0.0484 -0.00690
## 6 Lag5    0.0298 -0.00567 -0.00356 -0.0188  -0.0271  NA       -0.0220 -0.0349 
## 7 Volume  0.539   0.0409  -0.0434  -0.0418  -0.0484  -0.0220  NA       0.0146 
## 8 Today   0.0301 -0.0262  -0.0103  -0.00245 -0.00690 -0.0349   0.0146 NA

Visualisation 1 – rplot()

rplot(cor_Smarket, colours = c("indianred2", "black", "skyblue1"))

Most variables are nearly uncorrelated with each other. The one notable pair is Year and Volume, which show a moderate positive correlation.

Visualisation 2 – Heatmap

A manual heatmap-style correlation chart using paletteer.

cor_Smarket %>%
  stretch() %>%
  ggplot(aes(x, y, fill = r)) +
  geom_tile() +
  geom_text(aes(label = as.character(fashion(r)))) +
  scale_fill_paletteer_c("scico::roma", limits = c(-1, 1), direction = -1) +
  labs(title = "Correlation heatmap – Smarket variables",
       x = NULL, y = NULL, fill = "r")

Volume vs Year

If we plot Year against Volume we can see the upward trend in trading volume over time.

ggplot(Smarket, aes(Year, Volume)) +
  geom_jitter(height = 0, alpha = 0.4, colour = "steelblue") +
  labs(title = "Trading Volume increases over time",
       x = "Year", y = "Volume (billions of shares)")


4.2 Logistic Regression

Model Specification

We use logistic_reg() from parsnip with the glm engine and classification mode (both are the defaults).

lr_spec <- logistic_reg() %>%
  set_engine("glm") %>%
  set_mode("classification")

lr_spec
## Logistic Regression Model Specification (classification)
## 
## Computational engine: glm

Fit on Full Data

We model Direction as a function of the five lagged returns and Volume.

lr_fit <- lr_spec %>%
  fit(
    Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume,
    data = Smarket
  )

lr_fit
## parsnip model object
## 
## 
## Call:  stats::glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + 
##     Lag5 + Volume, family = stats::binomial, data = data)
## 
## Coefficients:
## (Intercept)         Lag1         Lag2         Lag3         Lag4         Lag5  
##   -0.126000    -0.073074    -0.042301     0.011085     0.009359     0.010313  
##      Volume  
##    0.135441  
## 
## Degrees of Freedom: 1249 Total (i.e. Null);  1243 Residual
## Null Deviance:       1731 
## Residual Deviance: 1728  AIC: 1742

Model Summary

lr_fit %>%
  pluck("fit") %>%
  summary()
## 
## Call:
## stats::glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + 
##     Lag5 + Volume, family = stats::binomial, data = data)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.126000   0.240736  -0.523    0.601
## Lag1        -0.073074   0.050167  -1.457    0.145
## Lag2        -0.042301   0.050086  -0.845    0.398
## Lag3         0.011085   0.049939   0.222    0.824
## Lag4         0.009359   0.049974   0.187    0.851
## Lag5         0.010313   0.049511   0.208    0.835
## Volume       0.135441   0.158360   0.855    0.392
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1731.2  on 1249  degrees of freedom
## Residual deviance: 1727.6  on 1243  degrees of freedom
## AIC: 1741.6
## 
## Number of Fisher Scoring iterations: 3

Tidy Coefficients

tidy(lr_fit)
## # A tibble: 7 × 5
##   term        estimate std.error statistic p.value
##   <chr>          <dbl>     <dbl>     <dbl>   <dbl>
## 1 (Intercept) -0.126      0.241     -0.523   0.601
## 2 Lag1        -0.0731     0.0502    -1.46    0.145
## 3 Lag2        -0.0423     0.0501    -0.845   0.398
## 4 Lag3         0.0111     0.0499     0.222   0.824
## 5 Lag4         0.00936    0.0500     0.187   0.851
## 6 Lag5         0.0103     0.0495     0.208   0.835
## 7 Volume       0.135      0.158      0.855   0.392

None of the p-values are below 0.05, suggesting that none of the predictors have a strong individual relationship with the direction of the market in this model.

Predictions on Full Data

Class Predictions

predict(lr_fit, new_data = Smarket)
## # A tibble: 1,250 × 1
##    .pred_class
##    <fct>      
##  1 Up         
##  2 Down       
##  3 Down       
##  4 Up         
##  5 Up         
##  6 Up         
##  7 Down       
##  8 Up         
##  9 Up         
## 10 Down       
## # ℹ 1,240 more rows

Probability Predictions

predict(lr_fit, new_data = Smarket, type = "prob")
## # A tibble: 1,250 × 2
##    .pred_Down .pred_Up
##         <dbl>    <dbl>
##  1      0.493    0.507
##  2      0.519    0.481
##  3      0.519    0.481
##  4      0.485    0.515
##  5      0.489    0.511
##  6      0.493    0.507
##  7      0.507    0.493
##  8      0.491    0.509
##  9      0.482    0.518
## 10      0.511    0.489
## # ℹ 1,240 more rows

Confusion Matrix

augment(lr_fit, new_data = Smarket) %>%
  conf_mat(truth = Direction, estimate = .pred_class)
##           Truth
## Prediction Down  Up
##       Down  145 141
##       Up    457 507
augment(lr_fit, new_data = Smarket) %>%
  conf_mat(truth = Direction, estimate = .pred_class) %>%
  autoplot(type = "heatmap") +
  labs(title = "Confusion Matrix – Logistic Regression (Full Data)")

Accuracy

augment(lr_fit, new_data = Smarket) %>%
  accuracy(truth = Direction, estimate = .pred_class)
## # A tibble: 1 × 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy binary         0.522

The accuracy is ~52.2%, barely above chance. This is expected since we are evaluating on the same data used to train the model.


Train / Test Split by Year

A more realistic evaluation: train on 2001–2004, test on 2005.

Smarket_train <- Smarket %>% filter(Year != 2005)
Smarket_test  <- Smarket %>% filter(Year == 2005)

cat("Training rows:", nrow(Smarket_train), "\n")
## Training rows: 998
cat("Test rows:    ", nrow(Smarket_test),  "\n")
## Test rows:     252

Fit Model 2 (All Predictors, Training Data)

lr_fit2 <- lr_spec %>%
  fit(
    Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume,
    data = Smarket_train
  )

Evaluate on Test Data

augment(lr_fit2, new_data = Smarket_test) %>%
  conf_mat(truth = Direction, estimate = .pred_class)
##           Truth
## Prediction Down Up
##       Down   77 97
##       Up     34 44
augment(lr_fit2, new_data = Smarket_test) %>%
  accuracy(truth = Direction, estimate = .pred_class)
## # A tibble: 1 × 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy binary         0.480

Accuracy drops to ~48% on unseen data — the model is worse than random chance when evaluated out-of-sample.


Reduced Model: Lag1 + Lag2 Only

The p-values from the full model suggested Lag1 and Lag2 had the most signal. Removing the other predictors should reduce variance.

lr_fit3 <- lr_spec %>%
  fit(
    Direction ~ Lag1 + Lag2,
    data = Smarket_train
  )

Confusion Matrix

augment(lr_fit3, new_data = Smarket_test) %>%
  conf_mat(truth = Direction, estimate = .pred_class)
##           Truth
## Prediction Down  Up
##       Down   35  35
##       Up     76 106
augment(lr_fit3, new_data = Smarket_test) %>%
  conf_mat(truth = Direction, estimate = .pred_class) %>%
  autoplot(type = "heatmap") +
  labs(title = "Confusion Matrix – Reduced Logistic Regression (Lag1 + Lag2)")

Accuracy

augment(lr_fit3, new_data = Smarket_test) %>%
  accuracy(truth = Direction, estimate = .pred_class)
## # A tibble: 1 × 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy binary         0.560

Accuracy improves to ~56% — removing the noisy predictors helped the model generalise better.


Predict for New Scenarios

We predict Direction for two hypothetical days:

Day Lag1 Lag2
1 1.2 1.1
2 1.5 −0.8
Smarket_new <- tibble(
  Lag1 = c(1.2, 1.5),
  Lag2 = c(1.1, -0.8)
)

predict(lr_fit3, new_data = Smarket_new, type = "prob")
## # A tibble: 2 × 2
##   .pred_Down .pred_Up
##        <dbl>    <dbl>
## 1      0.521    0.479
## 2      0.504    0.496

Both scenarios show a slightly higher probability of Down for the market — reflecting the slight negative coefficients for Lag1 and Lag2.


Summary

Model Data Used Accuracy
Logistic Regression (all 6 predictors) Full data (train = test) 52.2%
Logistic Regression (all 6 predictors) Train 2001–2004, Test 2005 48.0%
Logistic Regression (Lag1 + Lag2 only) Train 2001–2004, Test 2005 56.0%

Key takeaways:

  1. The Smarket variables are largely uncorrelated, except Year and Volume.
  2. Fitting and evaluating on the same data is overly optimistic — always hold out a test set.
  3. A simpler model using only Lag1 and Lag2 outperformed the full model on held-out data, demonstrating the bias-variance trade-off in action.
  4. Even the best logistic regression model here achieves only ~56% accuracy, which is modest — predicting stock market direction is genuinely hard!