Load required packages. Install any that are missing with
install.packages() or via CRAN.
library(tidymodels) # Core modelling framework
library(ISLR) # Provides the Smarket dataset
library(ISLR2) # Provides the Bikeshare dataset (used later)
library(discrim) # LDA / QDA / Naive Bayes via parsnip
library(poissonreg) # Poisson regression via parsnip
library(corrr) # Tidy correlation matrices
library(paletteer) # Extra colour palettesWe examine the Smarket dataset. It contains percentage
returns for the S&P 500 over 1,250 trading days (2001–2005),
together with a factor variable Direction that records
whether the market went Up or Down on
each day.
Because Direction is a factor we drop it before
computing correlations.
Most variables are nearly uncorrelated with each other. The clearest
exception is Year and Volume, which share a
noticeable positive association.
cor_Smarket %>%
stretch() %>%
ggplot(aes(x, y, fill = r)) +
geom_tile() +
geom_text(aes(label = as.character(fashion(r)))) +
scale_fill_gradient2(low = "indianred2", mid = "white", high = "skyblue1",
midpoint = 0, limits = c(-1, 1)) +
labs(
title = "Correlation heatmap — Smarket (numeric variables)",
x = NULL, y = NULL, fill = "r"
) +
theme_minimal()Plotting Volume over Year confirms the
upward trend visible in the correlation matrix.
ggplot(Smarket, aes(Year, Volume)) +
geom_jitter(height = 0, alpha = 0.4, colour = "steelblue") +
labs(
title = "Trading volume increases with time",
x = "Year",
y = "Volume (billions of shares traded)"
) +
theme_bw()We use logistic_reg() from parsnip. The
default engine is "glm" and the default mode is
"classification", so the two set_*() calls
below are shown for explicitness.
We predict Direction from the five lagged returns plus
Volume. Parsnip requires the response to be a
factor, which Smarket$Direction already
is.
lr_fit <- lr_spec %>%
fit(
Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume,
data = Smarket
)
lr_fit## parsnip model object
##
##
## Call: stats::glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 +
## Lag5 + Volume, family = stats::binomial, data = data)
##
## Coefficients:
## (Intercept) Lag1 Lag2 Lag3 Lag4 Lag5
## -0.126000 -0.073074 -0.042301 0.011085 0.009359 0.010313
## Volume
## 0.135441
##
## Degrees of Freedom: 1249 Total (i.e. Null); 1243 Residual
## Null Deviance: 1731
## Residual Deviance: 1728 AIC: 1742
##
## Call:
## stats::glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 +
## Lag5 + Volume, family = stats::binomial, data = data)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.126000 0.240736 -0.523 0.601
## Lag1 -0.073074 0.050167 -1.457 0.145
## Lag2 -0.042301 0.050086 -0.845 0.398
## Lag3 0.011085 0.049939 0.222 0.824
## Lag4 0.009359 0.049974 0.187 0.851
## Lag5 0.010313 0.049511 0.208 0.835
## Volume 0.135441 0.158360 0.855 0.392
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1731.2 on 1249 degrees of freedom
## Residual deviance: 1727.6 on 1243 degrees of freedom
## AIC: 1741.6
##
## Number of Fisher Scoring iterations: 3
None of the predictors are statistically significant at the 5 % level, which suggests this model has limited predictive power.
Because these are time-series data, we split by year rather than randomly. We train on years 2001–2004 and evaluate on 2005.
Smarket_train <- Smarket %>% filter(Year != 2005)
Smarket_test <- Smarket %>% filter(Year == 2005)
cat("Training rows:", nrow(Smarket_train), "\n")## Training rows: 998
## Test rows : 252
## Truth
## Prediction Down Up
## Down 77 97
## Up 34 44
Test accuracy drops to ~48 %, worse than the in-sample figure — a sign that the model is not generalising well.
Because Lag1 and Lag2 had the smallest
(least insignificant) p-values in the full model, we try a reduced
specification.
lr_fit3 <- lr_spec %>%
fit(
Direction ~ Lag1 + Lag2,
data = Smarket_train
)
augment(lr_fit3, new_data = Smarket_test) %>%
conf_mat(truth = Direction, estimate = .pred_class)## Truth
## Prediction Down Up
## Down 35 35
## Up 76 106
Accuracy improves to ~56 %. Removing the noisy predictors reduced variance enough to outweigh any loss in bias.
Predict the probability that the market goes Up or Down on two hypothetical days.
Smarket_new <- tibble(
Lag1 = c(1.2, 1.5),
Lag2 = c(1.1, -0.8)
)
predict(
lr_fit3,
new_data = Smarket_new,
type = "prob"
)On both days the model assigns a higher probability to Down than Up, suggesting a predicted decline when recent returns have been positive.
## R version 4.5.1 (2025-06-13 ucrt)
## Platform: x86_64-w64-mingw32/x64
## Running under: Windows 11 x64 (build 26200)
##
## Matrix products: default
## LAPACK version 3.12.1
##
## locale:
## [1] LC_COLLATE=English_United States.utf8
## [2] LC_CTYPE=English_United States.utf8
## [3] LC_MONETARY=English_United States.utf8
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United States.utf8
##
## time zone: Asia/Irkutsk
## tzcode source: internal
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] paletteer_1.7.0 corrr_0.4.5 poissonreg_1.0.1 discrim_1.1.0
## [5] ISLR2_1.3-2 ISLR_1.4 yardstick_1.3.2 workflowsets_1.1.1
## [9] workflows_1.3.0 tune_2.0.1 tidyr_1.3.1 tailor_0.1.0
## [13] rsample_1.3.1 recipes_1.3.1 purrr_1.1.0 parsnip_1.4.1
## [17] modeldata_1.5.1 infer_1.1.0 ggplot2_4.0.0 dplyr_1.1.4
## [21] dials_1.4.2 scales_1.4.0 broom_1.0.10 tidymodels_1.4.1
##
## loaded via a namespace (and not attached):
## [1] tidyselect_1.2.1 timeDate_4051.111 farver_2.1.2
## [4] S7_0.2.0 fastmap_1.2.0 digest_0.6.37
## [7] rpart_4.1.24 timechange_0.3.0 lifecycle_1.0.4
## [10] survival_3.8-3 magrittr_2.0.3 compiler_4.5.1
## [13] rlang_1.1.6 sass_0.4.10 tools_4.5.1
## [16] yaml_2.3.10 data.table_1.17.8 knitr_1.50
## [19] labeling_0.4.3 DiceDesign_1.10 RColorBrewer_1.1-3
## [22] withr_3.0.2 nnet_7.3-20 grid_4.5.1
## [25] sparsevctrs_0.3.4 future_1.67.0 globals_0.18.0
## [28] MASS_7.3-65 cli_3.6.5 rmarkdown_2.29
## [31] generics_0.1.4 rstudioapi_0.17.1 future.apply_1.20.0
## [34] cachem_1.1.0 splines_4.5.1 parallel_4.5.1
## [37] vctrs_0.6.5 hardhat_1.4.2 Matrix_1.7-3
## [40] jsonlite_2.0.0 listenv_0.9.1 gower_1.0.2
## [43] jquerylib_0.1.4 glue_1.8.0 parallelly_1.45.1
## [46] rematch2_2.1.2 codetools_0.2-20 lubridate_1.9.4
## [49] gtable_0.3.6 GPfit_1.0-9 tibble_3.3.0
## [52] pillar_1.11.1 furrr_0.3.1 htmltools_0.5.8.1
## [55] ipred_0.9-15 lava_1.8.2 R6_2.6.1
## [58] lhs_1.2.1 evaluate_1.0.5 lattice_0.22-7
## [61] backports_1.5.0 bslib_0.9.0 class_7.3-23
## [64] Rcpp_1.1.0 prodlim_2025.04.28 xfun_0.52
## [67] pkgconfig_2.0.3