This document replicates Sections 4.1 and 4.2 of the ISLR Tidymodels Labs — Chapter 4 Classification.
We use the tidymodels framework to explore the
Smarket data set and fit logistic regression models for
predicting stock market direction.
library(tidymodels) # Core modelling framework
library(ISLR) # Contains the Smarket data set
library(discrim) # For LDA / QDA / Naive Bayes (needed later)The Smarket data set records percentage
returns for the S&P 500 stock index over 1,250 trading days
from 2001 to 2005. Variables include:
Year — the year of the observationLag1 through Lag5 — percentage returns for
each of the five previous daysVolume — number of shares traded the previous day (in
billions)Today — the percentage return on the current dayDirection — whether the market went Up
or Down that day (response)## Rows: 1,250
## Columns: 9
## $ Year <dbl> 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, …
## $ Lag1 <dbl> 0.381, 0.959, 1.032, -0.623, 0.614, 0.213, 1.392, -0.403, 0.…
## $ Lag2 <dbl> -0.192, 0.381, 0.959, 1.032, -0.623, 0.614, 0.213, 1.392, -0…
## $ Lag3 <dbl> -2.624, -0.192, 0.381, 0.959, 1.032, -0.623, 0.614, 0.213, 1…
## $ Lag4 <dbl> -1.055, -2.624, -0.192, 0.381, 0.959, 1.032, -0.623, 0.614, …
## $ Lag5 <dbl> 5.010, -1.055, -2.624, -0.192, 0.381, 0.959, 1.032, -0.623, …
## $ Volume <dbl> 1.1913, 1.2965, 1.4112, 1.2760, 1.2057, 1.3491, 1.4450, 1.40…
## $ Today <dbl> 0.959, 1.032, -0.623, 0.614, 0.213, 1.392, -0.403, 0.027, 1.…
## $ Direction <fct> Up, Up, Down, Up, Up, Up, Down, Up, Up, Up, Down, Down, Up, …
## Year Lag1 Lag2 Lag3
## Min. :2001 Min. :-4.922000 Min. :-4.922000 Min. :-4.922000
## 1st Qu.:2002 1st Qu.:-0.639500 1st Qu.:-0.639500 1st Qu.:-0.640000
## Median :2003 Median : 0.039000 Median : 0.039000 Median : 0.038500
## Mean :2003 Mean : 0.003834 Mean : 0.003919 Mean : 0.001716
## 3rd Qu.:2004 3rd Qu.: 0.596750 3rd Qu.: 0.596750 3rd Qu.: 0.596750
## Max. :2005 Max. : 5.733000 Max. : 5.733000 Max. : 5.733000
## Lag4 Lag5 Volume Today
## Min. :-4.922000 Min. :-4.92200 Min. :0.3561 Min. :-4.922000
## 1st Qu.:-0.640000 1st Qu.:-0.64000 1st Qu.:1.2574 1st Qu.:-0.639500
## Median : 0.038500 Median : 0.03850 Median :1.4229 Median : 0.038500
## Mean : 0.001636 Mean : 0.00561 Mean :1.4783 Mean : 0.003138
## 3rd Qu.: 0.596750 3rd Qu.: 0.59700 3rd Qu.:1.6417 3rd Qu.: 0.596750
## Max. : 5.733000 Max. : 5.73300 Max. :3.1525 Max. : 5.733000
## Direction
## Down:602
## Up :648
##
##
##
##
The Direction column is a factor, so we exclude it
before computing correlations. Notice that most lag variables are nearly
uncorrelated with each other. The one notable correlation is between
Year and Volume, suggesting trading volume has
increased over time.
## Year Lag1 Lag2 Lag3 Lag4 Lag5 Volume Today
## Year 1.0000 0.0297 0.0306 0.0332 0.0357 0.0298 0.5390 0.0301
## Lag1 0.0297 1.0000 -0.0263 -0.0108 -0.0030 -0.0057 0.0409 -0.0262
## Lag2 0.0306 -0.0263 1.0000 -0.0259 -0.0109 -0.0036 -0.0434 -0.0103
## Lag3 0.0332 -0.0108 -0.0259 1.0000 -0.0241 -0.0188 -0.0418 -0.0024
## Lag4 0.0357 -0.0030 -0.0109 -0.0241 1.0000 -0.0271 -0.0484 -0.0069
## Lag5 0.0298 -0.0057 -0.0036 -0.0188 -0.0271 1.0000 -0.0220 -0.0349
## Volume 0.5390 0.0409 -0.0434 -0.0418 -0.0484 -0.0220 1.0000 0.0146
## Today 0.0301 -0.0262 -0.0103 -0.0024 -0.0069 -0.0349 0.0146 1.0000
In tidymodels, we define a logistic regression model
using parsnip. The logistic_reg() function
with set_engine("glm") uses R’s built-in glm()
function under the hood.
## Logistic Regression Model Specification (classification)
##
## Computational engine: glm
We first fit the model using all five lag variables and
volume on the complete Smarket data set.
lr_fit <- lr_spec %>%
fit(
Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume,
data = Smarket
)
lr_fit## parsnip model object
##
##
## Call: stats::glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 +
## Lag5 + Volume, family = stats::binomial, data = data)
##
## Coefficients:
## (Intercept) Lag1 Lag2 Lag3 Lag4 Lag5
## -0.126000 -0.073074 -0.042301 0.011085 0.009359 0.010313
## Volume
## 0.135441
##
## Degrees of Freedom: 1249 Total (i.e. Null); 1243 Residual
## Null Deviance: 1731
## Residual Deviance: 1728 AIC: 1742
## # A tibble: 7 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -0.126 0.241 -0.523 0.601
## 2 Lag1 -0.0731 0.0502 -1.46 0.145
## 3 Lag2 -0.0423 0.0501 -0.845 0.398
## 4 Lag3 0.0111 0.0499 0.222 0.824
## 5 Lag4 0.00936 0.0500 0.187 0.851
## 6 Lag5 0.0103 0.0495 0.208 0.835
## 7 Volume 0.135 0.158 0.855 0.392
Interpretation: None of the predictors have a statistically significant p-value (all are well above 0.05), suggesting the lag variables and volume do not have strong linear associations with market direction.
## Truth
## Prediction Down Up
## Down 145 141
## Up 457 507
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.522
The in-sample accuracy is about 52%, barely above random chance (50%).
Rather than a random split, the lab splits by year: observations before 2005 form the training set, and 2005 observations form the test set.
Smarket_train <- Smarket %>% filter(Year != 2005)
Smarket_test <- Smarket %>% filter(Year == 2005)
nrow(Smarket_train) # Training set size## [1] 998
## [1] 252
lr_fit2 <- lr_spec %>%
fit(
Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume,
data = Smarket_train
)
tidy(lr_fit2)## # A tibble: 7 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 0.191 0.334 0.573 0.567
## 2 Lag1 -0.0542 0.0518 -1.05 0.295
## 3 Lag2 -0.0458 0.0518 -0.884 0.377
## 4 Lag3 0.00720 0.0516 0.139 0.889
## 5 Lag4 0.00644 0.0517 0.125 0.901
## 6 Lag5 -0.00422 0.0511 -0.0826 0.934
## 7 Volume -0.116 0.240 -0.485 0.628
## Truth
## Prediction Down Up
## Down 77 97
## Up 34 44
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.480
The test accuracy drops below 50%, which is worse than random guessing. This is expected — the lag variables show little predictive power, and using weak predictors inflates variance without reducing bias.
Since Lag1 and Lag2 had the smallest (most
suggestive) p-values, we refit the model using only those two
predictors.
## # A tibble: 3 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 0.0322 0.0634 0.508 0.611
## 2 Lag1 -0.0556 0.0517 -1.08 0.282
## 3 Lag2 -0.0445 0.0517 -0.861 0.389
## Truth
## Prediction Down Up
## Down 35 35
## Up 76 106
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.560
augment(lr_fit3, new_data = Smarket_test) %>%
select(Direction, .pred_Down, .pred_Up, .pred_class) %>%
head(6)## # A tibble: 6 × 4
## Direction .pred_Down .pred_Up .pred_class
## <fct> <dbl> <dbl> <fct>
## 1 Down 0.490 0.510 Up
## 2 Down 0.479 0.521 Up
## 3 Down 0.467 0.533 Up
## 4 Up 0.474 0.526 Up
## 5 Down 0.493 0.507 Up
## 6 Up 0.494 0.506 Up
We can also predict direction for specific hypothetical Lag1 and Lag2 values:
new_obs <- tibble(
Lag1 = c(1.2, 1.5),
Lag2 = c(1.1, -0.8)
)
predict(lr_fit3, new_data = new_obs, type = "prob")## # A tibble: 2 × 2
## .pred_Down .pred_Up
## <dbl> <dbl>
## 1 0.521 0.479
## 2 0.504 0.496
| Model | Data Used | Test Accuracy |
|---|---|---|
| Full model (all lags + Volume) | Full Smarket |
~52% (in-sample) |
| Full model (all lags + Volume) | Train 2001–2004, Test 2005 | <50% |
| Reduced model (Lag1 + Lag2) | Train 2001–2004, Test 2005 | ~56% |
The reduced logistic regression model using only Lag1
and Lag2 performs slightly better on unseen 2005 data.
However, performance is modest overall — predicting daily stock market
direction from recent returns is inherently difficult.
## R version 4.5.3 (2026-03-11)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 20.04.6 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/liblapack.so.3; LAPACK version 3.9.0
##
## locale:
## [1] LC_CTYPE=C.UTF-8 LC_NUMERIC=C LC_TIME=C.UTF-8
## [4] LC_COLLATE=C.UTF-8 LC_MONETARY=C.UTF-8 LC_MESSAGES=C.UTF-8
## [7] LC_PAPER=C.UTF-8 LC_NAME=C LC_ADDRESS=C
## [10] LC_TELEPHONE=C LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C
##
## time zone: UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] discrim_1.1.0 ISLR_1.4 yardstick_1.3.2 workflowsets_1.1.1
## [5] workflows_1.3.0 tune_2.0.1 tidyr_1.3.2 tailor_0.1.0
## [9] rsample_1.3.2 recipes_1.3.1 purrr_1.2.1 parsnip_1.4.1
## [13] modeldata_1.5.1 infer_1.1.0 ggplot2_4.0.2 dplyr_1.2.0
## [17] dials_1.4.2 scales_1.4.0 broom_1.0.12 tidymodels_1.4.1
##
## loaded via a namespace (and not attached):
## [1] tidyselect_1.2.1 timeDate_4052.112 farver_2.1.2
## [4] S7_0.2.1 fastmap_1.2.0 digest_0.6.39
## [7] rpart_4.1.24 timechange_0.4.0 lifecycle_1.0.5
## [10] survival_3.8-3 magrittr_2.0.4 compiler_4.5.3
## [13] rlang_1.1.7 sass_0.4.10 tools_4.5.3
## [16] utf8_1.2.6 yaml_2.3.12 data.table_1.18.2.1
## [19] knitr_1.51 labeling_0.4.3 DiceDesign_1.10
## [22] RColorBrewer_1.1-3 withr_3.0.2 nnet_7.3-20
## [25] grid_4.5.3 sparsevctrs_0.3.6 future_1.70.0
## [28] globals_0.19.1 MASS_7.3-65 cli_3.6.5
## [31] rmarkdown_2.30 generics_0.1.4 rstudioapi_0.18.0
## [34] future.apply_1.20.2 cachem_1.1.0 splines_4.5.3
## [37] parallel_4.5.3 vctrs_0.7.1 hardhat_1.4.2
## [40] Matrix_1.7-4 jsonlite_2.0.0 listenv_0.10.1
## [43] gower_1.0.2 jquerylib_0.1.4 glue_1.8.0
## [46] parallelly_1.46.1 codetools_0.2-20 lubridate_1.9.5
## [49] gtable_0.3.6 GPfit_1.0-9 tibble_3.3.1
## [52] pillar_1.11.1 furrr_0.3.1 htmltools_0.5.9
## [55] ipred_0.9-15 lava_1.8.2 R6_2.6.1
## [58] lhs_1.2.1 evaluate_1.0.5 lattice_0.22-7
## [61] backports_1.5.0 bslib_0.10.0 class_7.3-23
## [64] Rcpp_1.1.1 prodlim_2026.03.11 xfun_0.56
## [67] pkgconfig_2.0.3