Overview

This document replicates Section 4.1 (The Stock Market Data) and Section 4.2 (Logistic Regression) from the ISLR Tidymodels Labs — Chapter 4.


Load Libraries

library(tidymodels)
library(ISLR)      # For the Smarket data set
library(ISLR2)     # For the Bikeshare data set
library(discrim)
library(poissonreg)
library(corrr)
library(paletteer)

4.1 The Stock Market Data

We examine the Smarket dataset, which contains percentage returns for the S&P 500 index over 1,250 trading days from 2001 to 2005. The response variable Direction has two labels: "Up" and "Down".

Correlation Matrix

We first look at the correlations between the numeric variables, excluding the categorical Direction column.

cor_Smarket <- Smarket %>%
  select(-Direction) %>%
  correlate()

cor_Smarket
## # A tibble: 8 × 9
##   term      Year     Lag1     Lag2     Lag3     Lag4     Lag5  Volume    Today
##   <chr>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>   <dbl>    <dbl>
## 1 Year   NA       0.0297   0.0306   0.0332   0.0357   0.0298   0.539   0.0301 
## 2 Lag1    0.0297 NA       -0.0263  -0.0108  -0.00299 -0.00567  0.0409 -0.0262 
## 3 Lag2    0.0306 -0.0263  NA       -0.0259  -0.0109  -0.00356 -0.0434 -0.0103 
## 4 Lag3    0.0332 -0.0108  -0.0259  NA       -0.0241  -0.0188  -0.0418 -0.00245
## 5 Lag4    0.0357 -0.00299 -0.0109  -0.0241  NA       -0.0271  -0.0484 -0.00690
## 6 Lag5    0.0298 -0.00567 -0.00356 -0.0188  -0.0271  NA       -0.0220 -0.0349 
## 7 Volume  0.539   0.0409  -0.0434  -0.0418  -0.0484  -0.0220  NA       0.0146 
## 8 Today   0.0301 -0.0262  -0.0103  -0.00245 -0.00690 -0.0349   0.0146 NA

Visualise with rplot()

rplot(cor_Smarket, colours = c("indianred2", "black", "skyblue1"))
Correlation chart using rplot. Most values are near 0; Year and Volume are somewhat correlated.

Correlation chart using rplot. Most values are near 0; Year and Volume are somewhat correlated.

Heatmap-Style Correlation Chart

cor_Smarket %>%
  stretch() %>%
  ggplot(aes(x, y, fill = r)) +
  geom_tile() +
  geom_text(aes(label = as.character(fashion(r)))) +
  scale_fill_paletteer_c("scico::roma", limits = c(-1, 1), direction = -1)
Heatmap-style correlation chart. Most values are near 0; Year and Volume show the strongest correlation.

Heatmap-style correlation chart. Most values are near 0; Year and Volume show the strongest correlation.

Most variables are nearly uncorrelated with each other. The notable exception is Year and Volume, which show a moderate positive relationship.

Volume Over Time

Plotting Year against Volume confirms an upward trend in trading volume over time.

ggplot(Smarket, aes(Year, Volume)) +
  geom_jitter(height = 0)
Jittered scatter chart showing Volume increasing with Year.

Jittered scatter chart showing Volume increasing with Year.


4.2 Logistic Regression

We now fit a logistic regression model using parsnip. The goal is to predict Direction (whether the market goes Up or Down) based on the five previous days’ percentage returns (Lag1Lag5) and Volume.

Model Specification

lr_spec <- logistic_reg() %>%
  set_engine("glm") %>%
  set_mode("classification")

Fit on Full Data

lr_fit <- lr_spec %>%
  fit(
    Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume,
    data = Smarket
  )

lr_fit
## parsnip model object
## 
## 
## Call:  stats::glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + 
##     Lag5 + Volume, family = stats::binomial, data = data)
## 
## Coefficients:
## (Intercept)         Lag1         Lag2         Lag3         Lag4         Lag5  
##   -0.126000    -0.073074    -0.042301     0.011085     0.009359     0.010313  
##      Volume  
##    0.135441  
## 
## Degrees of Freedom: 1249 Total (i.e. Null);  1243 Residual
## Null Deviance:       1731 
## Residual Deviance: 1728  AIC: 1742

Model Summary

lr_fit %>%
  pluck("fit") %>%
  summary()
## 
## Call:
## stats::glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + 
##     Lag5 + Volume, family = stats::binomial, data = data)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.126000   0.240736  -0.523    0.601
## Lag1        -0.073074   0.050167  -1.457    0.145
## Lag2        -0.042301   0.050086  -0.845    0.398
## Lag3         0.011085   0.049939   0.222    0.824
## Lag4         0.009359   0.049974   0.187    0.851
## Lag5         0.010313   0.049511   0.208    0.835
## Volume       0.135441   0.158360   0.855    0.392
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1731.2  on 1249  degrees of freedom
## Residual deviance: 1727.6  on 1243  degrees of freedom
## AIC: 1741.6
## 
## Number of Fisher Scoring iterations: 3

Tidy Coefficient Table

tidy(lr_fit)
## # A tibble: 7 × 5
##   term        estimate std.error statistic p.value
##   <chr>          <dbl>     <dbl>     <dbl>   <dbl>
## 1 (Intercept) -0.126      0.241     -0.523   0.601
## 2 Lag1        -0.0731     0.0502    -1.46    0.145
## 3 Lag2        -0.0423     0.0501    -0.845   0.398
## 4 Lag3         0.0111     0.0499     0.222   0.824
## 5 Lag4         0.00936    0.0500     0.187   0.851
## 6 Lag5         0.0103     0.0495     0.208   0.835
## 7 Volume       0.135      0.158      0.855   0.392

None of the predictors have p-values below 0.05, suggesting limited predictive power in this full model.

Predictions (Class Labels)

predict(lr_fit, new_data = Smarket)
## # A tibble: 1,250 × 1
##    .pred_class
##    <fct>      
##  1 Up         
##  2 Down       
##  3 Down       
##  4 Up         
##  5 Up         
##  6 Up         
##  7 Down       
##  8 Up         
##  9 Up         
## 10 Down       
## # ℹ 1,240 more rows

Predictions (Probabilities)

predict(lr_fit, new_data = Smarket, type = "prob")
## # A tibble: 1,250 × 2
##    .pred_Down .pred_Up
##         <dbl>    <dbl>
##  1      0.493    0.507
##  2      0.519    0.481
##  3      0.519    0.481
##  4      0.485    0.515
##  5      0.489    0.511
##  6      0.493    0.507
##  7      0.507    0.493
##  8      0.491    0.509
##  9      0.482    0.518
## 10      0.511    0.489
## # ℹ 1,240 more rows

Confusion Matrix (Training Data)

augment(lr_fit, new_data = Smarket) %>%
  conf_mat(truth = Direction, estimate = .pred_class)
##           Truth
## Prediction Down  Up
##       Down  145 141
##       Up    457 507

Confusion Matrix Heatmap

augment(lr_fit, new_data = Smarket) %>%
  conf_mat(truth = Direction, estimate = .pred_class) %>%
  autoplot(type = "heatmap")
Confusion matrix heatmap. The model tends to over-predict 'Up'.

Confusion matrix heatmap. The model tends to over-predict ‘Up’.

Accuracy (Training Data)

augment(lr_fit, new_data = Smarket) %>%
  accuracy(truth = Direction, estimate = .pred_class)
## # A tibble: 1 × 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy binary         0.522

The training accuracy is ~52.2%, barely above random chance.


Train / Test Split

Since the data has a time component, we split by year: train on data before 2005, test on 2005.

Smarket_train <- Smarket %>% filter(Year != 2005)
Smarket_test  <- Smarket %>% filter(Year == 2005)

Fit on Training Data

lr_fit2 <- lr_spec %>%
  fit(
    Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume,
    data = Smarket_train
  )

Evaluate on Test Data

augment(lr_fit2, new_data = Smarket_test) %>%
  conf_mat(truth = Direction, estimate = .pred_class)
##           Truth
## Prediction Down Up
##       Down   77 97
##       Up     34 44
augment(lr_fit2, new_data = Smarket_test) %>%
  accuracy(truth = Direction, estimate = .pred_class)
## # A tibble: 1 × 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy binary         0.480

The test accuracy drops to ~48%, worse than random. Evaluating on held-out data gives a more honest picture of model performance.


Reduced Model (Lag1 + Lag2 Only)

Since most predictors had high p-values, we try a simpler model using only Lag1 and Lag2.

lr_fit3 <- lr_spec %>%
  fit(
    Direction ~ Lag1 + Lag2,
    data = Smarket_train
  )

augment(lr_fit3, new_data = Smarket_test) %>%
  conf_mat(truth = Direction, estimate = .pred_class)
##           Truth
## Prediction Down  Up
##       Down   35  35
##       Up     76 106
augment(lr_fit3, new_data = Smarket_test) %>%
  accuracy(truth = Direction, estimate = .pred_class)
## # A tibble: 1 × 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy binary         0.560

Accuracy improves to ~56%. Dropping irrelevant predictors reduces variance without increasing bias.


Predicting for New Observations

We predict the market direction for two specific scenarios:

Scenario Lag1 Lag2
1 1.2 1.1
2 1.5 −0.8
Smarket_new <- tibble(
  Lag1 = c(1.2, 1.5),
  Lag2 = c(1.1, -0.8)
)

predict(
  lr_fit3,
  new_data = Smarket_new,
  type = "prob"
)
## # A tibble: 2 × 2
##   .pred_Down .pred_Up
##        <dbl>    <dbl>
## 1      0.521    0.479
## 2      0.504    0.496

Both scenarios show a slightly higher predicted probability for Down than Up, though the probabilities are close to 50/50.


Summary

Model Data Used for Evaluation Accuracy
Full LR Training (in-sample) 52.2%
Full LR Test (2005) 48.0%
Reduced LR (Lag1+Lag2) Test (2005) 56.0%

Key takeaways: - In-sample accuracy is overly optimistic — always evaluate on held-out data. - Removing low-signal predictors can improve out-of-sample performance. - Even the best logistic regression here barely beats random chance, suggesting stock market returns are hard to predict from lag values alone.