1 Introduction

This document replicates Sections 4.1 and 4.2 of the ISLR Tidymodels Labs — Chapter 4 Classification.

We use the tidymodels framework to explore the Smarket data set and fit logistic regression models for predicting stock market direction.

2 Load Packages

library(tidymodels)   # Core modelling framework
library(ISLR)         # Contains the Smarket data set
library(discrim)      # For LDA / QDA / Naive Bayes (needed later)

3 Section 4.1 — The Stock Market Data

3.1 Overview of the Data

The Smarket data set records percentage returns for the S&P 500 stock index over 1,250 trading days from 2001 to 2005. Variables include:

Year — the year of the observation
Lag1 through Lag5 — percentage returns for each of the five previous days
Volume — number of shares traded the previous day (in billions)
Today — the percentage return on the current day
Direction — whether the market went Up or Down that day (response)

glimpse(Smarket)

## Rows: 1,250
## Columns: 9
## $ Year      <dbl> 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, …
## $ Lag1      <dbl> 0.381, 0.959, 1.032, -0.623, 0.614, 0.213, 1.392, -0.403, 0.…
## $ Lag2      <dbl> -0.192, 0.381, 0.959, 1.032, -0.623, 0.614, 0.213, 1.392, -0…
## $ Lag3      <dbl> -2.624, -0.192, 0.381, 0.959, 1.032, -0.623, 0.614, 0.213, 1…
## $ Lag4      <dbl> -1.055, -2.624, -0.192, 0.381, 0.959, 1.032, -0.623, 0.614, …
## $ Lag5      <dbl> 5.010, -1.055, -2.624, -0.192, 0.381, 0.959, 1.032, -0.623, …
## $ Volume    <dbl> 1.1913, 1.2965, 1.4112, 1.2760, 1.2057, 1.3491, 1.4450, 1.40…
## $ Today     <dbl> 0.959, 1.032, -0.623, 0.614, 0.213, 1.392, -0.403, 0.027, 1.…
## $ Direction <fct> Up, Up, Down, Up, Up, Up, Down, Up, Up, Up, Down, Down, Up, …

3.2 Summary Statistics

summary(Smarket)

##       Year           Lag1                Lag2                Lag3          
##  Min.   :2001   Min.   :-4.922000   Min.   :-4.922000   Min.   :-4.922000  
##  1st Qu.:2002   1st Qu.:-0.639500   1st Qu.:-0.639500   1st Qu.:-0.640000  
##  Median :2003   Median : 0.039000   Median : 0.039000   Median : 0.038500  
##  Mean   :2003   Mean   : 0.003834   Mean   : 0.003919   Mean   : 0.001716  
##  3rd Qu.:2004   3rd Qu.: 0.596750   3rd Qu.: 0.596750   3rd Qu.: 0.596750  
##  Max.   :2005   Max.   : 5.733000   Max.   : 5.733000   Max.   : 5.733000  
##       Lag4                Lag5              Volume           Today          
##  Min.   :-4.922000   Min.   :-4.92200   Min.   :0.3561   Min.   :-4.922000  
##  1st Qu.:-0.640000   1st Qu.:-0.64000   1st Qu.:1.2574   1st Qu.:-0.639500  
##  Median : 0.038500   Median : 0.03850   Median :1.4229   Median : 0.038500  
##  Mean   : 0.001636   Mean   : 0.00561   Mean   :1.4783   Mean   : 0.003138  
##  3rd Qu.: 0.596750   3rd Qu.: 0.59700   3rd Qu.:1.6417   3rd Qu.: 0.596750  
##  Max.   : 5.733000   Max.   : 5.73300   Max.   :3.1525   Max.   : 5.733000  
##  Direction 
##  Down:602  
##  Up  :648  
##            
##            
##            
##

3.3 Correlation Matrix

The Direction column is a factor, so we exclude it before computing correlations. Notice that most lag variables are nearly uncorrelated with each other. The one notable correlation is between Year and Volume, suggesting trading volume has increased over time.

Smarket %>%
  select(-Direction) %>%
  cor() %>%
  round(4)

##          Year    Lag1    Lag2    Lag3    Lag4    Lag5  Volume   Today
## Year   1.0000  0.0297  0.0306  0.0332  0.0357  0.0298  0.5390  0.0301
## Lag1   0.0297  1.0000 -0.0263 -0.0108 -0.0030 -0.0057  0.0409 -0.0262
## Lag2   0.0306 -0.0263  1.0000 -0.0259 -0.0109 -0.0036 -0.0434 -0.0103
## Lag3   0.0332 -0.0108 -0.0259  1.0000 -0.0241 -0.0188 -0.0418 -0.0024
## Lag4   0.0357 -0.0030 -0.0109 -0.0241  1.0000 -0.0271 -0.0484 -0.0069
## Lag5   0.0298 -0.0057 -0.0036 -0.0188 -0.0271  1.0000 -0.0220 -0.0349
## Volume 0.5390  0.0409 -0.0434 -0.0418 -0.0484 -0.0220  1.0000  0.0146
## Today  0.0301 -0.0262 -0.0103 -0.0024 -0.0069 -0.0349  0.0146  1.0000

3.4 Volume Over Time

We can see the upward trend in volume by plotting it:

Smarket %>%
  mutate(index = row_number()) %>%
  ggplot(aes(x = index, y = Volume)) +
  geom_line(colour = "steelblue") +
  labs(
    title = "Trading Volume Over Time",
    x     = "Day Index",
    y     = "Volume (billions of shares)"
  ) +
  theme_bw()

4 Section 4.2 — Logistic Regression

4.1 Model Specification

In tidymodels, we define a logistic regression model using parsnip. The logistic_reg() function with set_engine("glm") uses R’s built-in glm() function under the hood.

lr_spec <- logistic_reg() %>%
  set_engine("glm") %>%
  set_mode("classification")

lr_spec

## Logistic Regression Model Specification (classification)
## 
## Computational engine: glm

4.2 Fit on the Full Data Set

We first fit the model using all five lag variables and volume on the complete Smarket data set.

lr_fit <- lr_spec %>%
  fit(
    Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume,
    data = Smarket
  )

lr_fit

## parsnip model object
## 
## 
## Call:  stats::glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + 
##     Lag5 + Volume, family = stats::binomial, data = data)
## 
## Coefficients:
## (Intercept)         Lag1         Lag2         Lag3         Lag4         Lag5  
##   -0.126000    -0.073074    -0.042301     0.011085     0.009359     0.010313  
##      Volume  
##    0.135441  
## 
## Degrees of Freedom: 1249 Total (i.e. Null);  1243 Residual
## Null Deviance:       1731 
## Residual Deviance: 1728  AIC: 1742

4.2.1 Tidy Coefficient Table

tidy(lr_fit)

## # A tibble: 7 × 5
##   term        estimate std.error statistic p.value
##   <chr>          <dbl>     <dbl>     <dbl>   <dbl>
## 1 (Intercept) -0.126      0.241     -0.523   0.601
## 2 Lag1        -0.0731     0.0502    -1.46    0.145
## 3 Lag2        -0.0423     0.0501    -0.845   0.398
## 4 Lag3         0.0111     0.0499     0.222   0.824
## 5 Lag4         0.00936    0.0500     0.187   0.851
## 6 Lag5         0.0103     0.0495     0.208   0.835
## 7 Volume       0.135      0.158      0.855   0.392

Interpretation: None of the predictors have a statistically significant p-value (all are well above 0.05), suggesting the lag variables and volume do not have strong linear associations with market direction.

4.2.2 In-Sample Predictions

augment(lr_fit, new_data = Smarket) %>%
  conf_mat(truth = Direction, estimate = .pred_class)

##           Truth
## Prediction Down  Up
##       Down  145 141
##       Up    457 507

augment(lr_fit, new_data = Smarket) %>%
  accuracy(truth = Direction, estimate = .pred_class)

## # A tibble: 1 × 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy binary         0.522

The in-sample accuracy is about 52%, barely above random chance (50%).

4.3 Train / Test Split (2001–2004 vs. 2005)

Rather than a random split, the lab splits by year: observations before 2005 form the training set, and 2005 observations form the test set.

Smarket_train <- Smarket %>% filter(Year != 2005)
Smarket_test  <- Smarket %>% filter(Year == 2005)

nrow(Smarket_train)  # Training set size

## [1] 998

nrow(Smarket_test)   # Test set size

## [1] 252

4.4 Fit on Training Data

lr_fit2 <- lr_spec %>%
  fit(
    Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume,
    data = Smarket_train
  )

tidy(lr_fit2)

## # A tibble: 7 × 5
##   term        estimate std.error statistic p.value
##   <chr>          <dbl>     <dbl>     <dbl>   <dbl>
## 1 (Intercept)  0.191      0.334     0.573    0.567
## 2 Lag1        -0.0542     0.0518   -1.05     0.295
## 3 Lag2        -0.0458     0.0518   -0.884    0.377
## 4 Lag3         0.00720    0.0516    0.139    0.889
## 5 Lag4         0.00644    0.0517    0.125    0.901
## 6 Lag5        -0.00422    0.0511   -0.0826   0.934
## 7 Volume      -0.116      0.240    -0.485    0.628

4.4.1 Evaluate on Test Data (2005)

augment(lr_fit2, new_data = Smarket_test) %>%
  conf_mat(truth = Direction, estimate = .pred_class)

##           Truth
## Prediction Down Up
##       Down   77 97
##       Up     34 44

augment(lr_fit2, new_data = Smarket_test) %>%
  accuracy(truth = Direction, estimate = .pred_class)

## # A tibble: 1 × 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy binary         0.480

The test accuracy drops below 50%, which is worse than random guessing. This is expected — the lag variables show little predictive power, and using weak predictors inflates variance without reducing bias.

4.5 Simplified Model: Only Lag1 and Lag2

Since Lag1 and Lag2 had the smallest (most suggestive) p-values, we refit the model using only those two predictors.

lr_fit3 <- lr_spec %>%
  fit(
    Direction ~ Lag1 + Lag2,
    data = Smarket_train
  )

tidy(lr_fit3)

## # A tibble: 3 × 5
##   term        estimate std.error statistic p.value
##   <chr>          <dbl>     <dbl>     <dbl>   <dbl>
## 1 (Intercept)   0.0322    0.0634     0.508   0.611
## 2 Lag1         -0.0556    0.0517    -1.08    0.282
## 3 Lag2         -0.0445    0.0517    -0.861   0.389

4.5.1 Confusion Matrix on Test Set

augment(lr_fit3, new_data = Smarket_test) %>%
  conf_mat(truth = Direction, estimate = .pred_class)

##           Truth
## Prediction Down  Up
##       Down   35  35
##       Up     76 106

4.5.2 Accuracy on Test Set

augment(lr_fit3, new_data = Smarket_test) %>%
  accuracy(truth = Direction, estimate = .pred_class)

## # A tibble: 1 × 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy binary         0.560

4.5.3 Predicted Probabilities (first 6 rows)

augment(lr_fit3, new_data = Smarket_test) %>%
  select(Direction, .pred_Down, .pred_Up, .pred_class) %>%
  head(6)

## # A tibble: 6 × 4
##   Direction .pred_Down .pred_Up .pred_class
##   <fct>          <dbl>    <dbl> <fct>      
## 1 Down           0.490    0.510 Up         
## 2 Down           0.479    0.521 Up         
## 3 Down           0.467    0.533 Up         
## 4 Up             0.474    0.526 Up         
## 5 Down           0.493    0.507 Up         
## 6 Up             0.494    0.506 Up

4.5.4 Predicting for New Observations

We can also predict direction for specific hypothetical Lag1 and Lag2 values:

new_obs <- tibble(
  Lag1 = c(1.2, 1.5),
  Lag2 = c(1.1, -0.8)
)

predict(lr_fit3, new_data = new_obs, type = "prob")

## # A tibble: 2 × 2
##   .pred_Down .pred_Up
##        <dbl>    <dbl>
## 1      0.521    0.479
## 2      0.504    0.496

5 Summary

Model	Data Used	Test Accuracy
Full model (all lags + Volume)	Full `Smarket`	~52% (in-sample)
Full model (all lags + Volume)	Train 2001–2004, Test 2005	<50%
Reduced model (Lag1 + Lag2)	Train 2001–2004, Test 2005	~56%

The reduced logistic regression model using only Lag1 and Lag2 performs slightly better on unseen 2005 data. However, performance is modest overall — predicting daily stock market direction from recent returns is inherently difficult.

6 Session Info

sessionInfo()

## R version 4.5.3 (2026-03-11)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 20.04.6 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/liblapack.so.3;  LAPACK version 3.9.0
## 
## locale:
##  [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8       
##  [4] LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8   
##  [7] LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C          
## [10] LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   
## 
## time zone: UTC
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] discrim_1.1.0      ISLR_1.4           yardstick_1.3.2    workflowsets_1.1.1
##  [5] workflows_1.3.0    tune_2.0.1         tidyr_1.3.2        tailor_0.1.0      
##  [9] rsample_1.3.2      recipes_1.3.1      purrr_1.2.1        parsnip_1.4.1     
## [13] modeldata_1.5.1    infer_1.1.0        ggplot2_4.0.2      dplyr_1.2.0       
## [17] dials_1.4.2        scales_1.4.0       broom_1.0.12       tidymodels_1.4.1  
## 
## loaded via a namespace (and not attached):
##  [1] tidyselect_1.2.1    timeDate_4052.112   farver_2.1.2       
##  [4] S7_0.2.1            fastmap_1.2.0       digest_0.6.39      
##  [7] rpart_4.1.24        timechange_0.4.0    lifecycle_1.0.5    
## [10] survival_3.8-3      magrittr_2.0.4      compiler_4.5.3     
## [13] rlang_1.1.7         sass_0.4.10         tools_4.5.3        
## [16] utf8_1.2.6          yaml_2.3.12         data.table_1.18.2.1
## [19] knitr_1.51          labeling_0.4.3      DiceDesign_1.10    
## [22] RColorBrewer_1.1-3  withr_3.0.2         nnet_7.3-20        
## [25] grid_4.5.3          sparsevctrs_0.3.6   future_1.70.0      
## [28] globals_0.19.1      MASS_7.3-65         cli_3.6.5          
## [31] rmarkdown_2.30      generics_0.1.4      rstudioapi_0.18.0  
## [34] future.apply_1.20.2 cachem_1.1.0        splines_4.5.3      
## [37] parallel_4.5.3      vctrs_0.7.1         hardhat_1.4.2      
## [40] Matrix_1.7-4        jsonlite_2.0.0      listenv_0.10.1     
## [43] gower_1.0.2         jquerylib_0.1.4     glue_1.8.0         
## [46] parallelly_1.46.1   codetools_0.2-20    lubridate_1.9.5    
## [49] gtable_0.3.6        GPfit_1.0-9         tibble_3.3.1       
## [52] pillar_1.11.1       furrr_0.3.1         htmltools_0.5.9    
## [55] ipred_0.9-15        lava_1.8.2          R6_2.6.1           
## [58] lhs_1.2.1           evaluate_1.0.5      lattice_0.22-7     
## [61] backports_1.5.0     bslib_0.10.0        class_7.3-23       
## [64] Rcpp_1.1.1          prodlim_2026.03.11  xfun_0.56          
## [67] pkgconfig_2.0.3

Chapter 4: Classification — Sections 4.1 & 4.2

ISLR Tidymodels Labs Replication

Minjin

2026-03-28