Setup

Load required packages. Install any that are missing with install.packages() or via CRAN.

library(tidymodels)   # Core modelling framework
library(ISLR)         # Provides the Smarket dataset
library(ISLR2)        # Provides the Bikeshare dataset (used later)
library(discrim)      # LDA / QDA / Naive Bayes via parsnip
library(poissonreg)   # Poisson regression via parsnip
library(corrr)        # Tidy correlation matrices
library(paletteer)    # Extra colour palettes

4.1 The Stock Market Data

We examine the Smarket dataset. It contains percentage returns for the S&P 500 over 1,250 trading days (2001–2005), together with a factor variable Direction that records whether the market went Up or Down on each day.

Correlation matrix (numeric variables only)

Because Direction is a factor we drop it before computing correlations.

cor_Smarket <- Smarket %>%
  select(-Direction) %>%
  correlate()

cor_Smarket

Quick correlation plot

rplot(cor_Smarket, colours = c("indianred2", "black", "skyblue1"))

Most variables are nearly uncorrelated with each other. The clearest exception is Year and Volume, which share a noticeable positive association.

Heatmap-style correlation chart

cor_Smarket %>%
  stretch() %>%
  ggplot(aes(x, y, fill = r)) +
  geom_tile() +
  geom_text(aes(label = as.character(fashion(r)))) +
  scale_fill_gradient2(low = "indianred2", mid = "white", high = "skyblue1",
                       midpoint = 0, limits = c(-1, 1)) +
  labs(
    title = "Correlation heatmap — Smarket (numeric variables)",
    x = NULL, y = NULL, fill = "r"
  ) +
  theme_minimal()

Volume vs. Year

Plotting Volume over Year confirms the upward trend visible in the correlation matrix.

ggplot(Smarket, aes(Year, Volume)) +
  geom_jitter(height = 0, alpha = 0.4, colour = "steelblue") +
  labs(
    title = "Trading volume increases with time",
    x = "Year",
    y = "Volume (billions of shares traded)"
  ) +
  theme_bw()


4.2 Logistic Regression

Model specification

We use logistic_reg() from parsnip. The default engine is "glm" and the default mode is "classification", so the two set_*() calls below are shown for explicitness.

lr_spec <- logistic_reg() %>%
  set_engine("glm") %>%
  set_mode("classification")

Fit on the full dataset (training = test — for illustration only)

We predict Direction from the five lagged returns plus Volume. Parsnip requires the response to be a factor, which Smarket$Direction already is.

lr_fit <- lr_spec %>%
  fit(
    Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume,
    data = Smarket
  )

lr_fit
## parsnip model object
## 
## 
## Call:  stats::glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + 
##     Lag5 + Volume, family = stats::binomial, data = data)
## 
## Coefficients:
## (Intercept)         Lag1         Lag2         Lag3         Lag4         Lag5  
##   -0.126000    -0.073074    -0.042301     0.011085     0.009359     0.010313  
##      Volume  
##    0.135441  
## 
## Degrees of Freedom: 1249 Total (i.e. Null);  1243 Residual
## Null Deviance:       1731 
## Residual Deviance: 1728  AIC: 1742

Full GLM summary

lr_fit %>%
  pluck("fit") %>%
  summary()
## 
## Call:
## stats::glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + 
##     Lag5 + Volume, family = stats::binomial, data = data)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.126000   0.240736  -0.523    0.601
## Lag1        -0.073074   0.050167  -1.457    0.145
## Lag2        -0.042301   0.050086  -0.845    0.398
## Lag3         0.011085   0.049939   0.222    0.824
## Lag4         0.009359   0.049974   0.187    0.851
## Lag5         0.010313   0.049511   0.208    0.835
## Volume       0.135441   0.158360   0.855    0.392
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1731.2  on 1249  degrees of freedom
## Residual deviance: 1727.6  on 1243  degrees of freedom
## AIC: 1741.6
## 
## Number of Fisher Scoring iterations: 3

Tidy coefficient table

tidy(lr_fit)

None of the predictors are statistically significant at the 5 % level, which suggests this model has limited predictive power.

Predictions (class labels)

predict(lr_fit, new_data = Smarket)

Predictions (probabilities)

predict(lr_fit, new_data = Smarket, type = "prob")

Confusion matrix (in-sample)

augment(lr_fit, new_data = Smarket) %>%
  conf_mat(truth = Direction, estimate = .pred_class)
##           Truth
## Prediction Down  Up
##       Down  145 141
##       Up    457 507
augment(lr_fit, new_data = Smarket) %>%
  conf_mat(truth = Direction, estimate = .pred_class) %>%
  autoplot(type = "heatmap") +
  labs(title = "Confusion matrix — full Smarket (in-sample)")

Accuracy (in-sample)

augment(lr_fit, new_data = Smarket) %>%
  accuracy(truth = Direction, estimate = .pred_class)

Accuracy is only ~52 %, barely above chance.


Proper train / test split (temporal)

Because these are time-series data, we split by year rather than randomly. We train on years 2001–2004 and evaluate on 2005.

Smarket_train <- Smarket %>% filter(Year != 2005)
Smarket_test  <- Smarket %>% filter(Year == 2005)

cat("Training rows:", nrow(Smarket_train), "\n")
## Training rows: 998
cat("Test rows    :", nrow(Smarket_test),  "\n")
## Test rows    : 252

Fit on training data (all six predictors)

lr_fit2 <- lr_spec %>%
  fit(
    Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume,
    data = Smarket_train
  )

Evaluate on test data

augment(lr_fit2, new_data = Smarket_test) %>%
  conf_mat(truth = Direction, estimate = .pred_class)
##           Truth
## Prediction Down Up
##       Down   77 97
##       Up     34 44
augment(lr_fit2, new_data = Smarket_test) %>%
  accuracy(truth = Direction, estimate = .pred_class)

Test accuracy drops to ~48 %, worse than the in-sample figure — a sign that the model is not generalising well.


Reduced model: only Lag1 and Lag2

Because Lag1 and Lag2 had the smallest (least insignificant) p-values in the full model, we try a reduced specification.

lr_fit3 <- lr_spec %>%
  fit(
    Direction ~ Lag1 + Lag2,
    data = Smarket_train
  )

augment(lr_fit3, new_data = Smarket_test) %>%
  conf_mat(truth = Direction, estimate = .pred_class)
##           Truth
## Prediction Down  Up
##       Down   35  35
##       Up     76 106
augment(lr_fit3, new_data = Smarket_test) %>%
  accuracy(truth = Direction, estimate = .pred_class)

Accuracy improves to ~56 %. Removing the noisy predictors reduced variance enough to outweigh any loss in bias.


Predicting for new observations

Predict the probability that the market goes Up or Down on two hypothetical days.

Smarket_new <- tibble(
  Lag1 = c(1.2, 1.5),
  Lag2 = c(1.1, -0.8)
)

predict(
  lr_fit3,
  new_data = Smarket_new,
  type     = "prob"
)

On both days the model assigns a higher probability to Down than Up, suggesting a predicted decline when recent returns have been positive.


Session Info

sessionInfo()
## R version 4.5.1 (2025-06-13 ucrt)
## Platform: x86_64-w64-mingw32/x64
## Running under: Windows 11 x64 (build 26200)
## 
## Matrix products: default
##   LAPACK version 3.12.1
## 
## locale:
## [1] LC_COLLATE=English_United States.utf8 
## [2] LC_CTYPE=English_United States.utf8   
## [3] LC_MONETARY=English_United States.utf8
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.utf8    
## 
## time zone: Asia/Irkutsk
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] paletteer_1.7.0    corrr_0.4.5        poissonreg_1.0.1   discrim_1.1.0     
##  [5] ISLR2_1.3-2        ISLR_1.4           yardstick_1.3.2    workflowsets_1.1.1
##  [9] workflows_1.3.0    tune_2.0.1         tidyr_1.3.1        tailor_0.1.0      
## [13] rsample_1.3.1      recipes_1.3.1      purrr_1.1.0        parsnip_1.4.1     
## [17] modeldata_1.5.1    infer_1.1.0        ggplot2_4.0.0      dplyr_1.1.4       
## [21] dials_1.4.2        scales_1.4.0       broom_1.0.10       tidymodels_1.4.1  
## 
## loaded via a namespace (and not attached):
##  [1] tidyselect_1.2.1    timeDate_4051.111   farver_2.1.2       
##  [4] S7_0.2.0            fastmap_1.2.0       digest_0.6.37      
##  [7] rpart_4.1.24        timechange_0.3.0    lifecycle_1.0.4    
## [10] survival_3.8-3      magrittr_2.0.3      compiler_4.5.1     
## [13] rlang_1.1.6         sass_0.4.10         tools_4.5.1        
## [16] yaml_2.3.10         data.table_1.17.8   knitr_1.50         
## [19] labeling_0.4.3      DiceDesign_1.10     RColorBrewer_1.1-3 
## [22] withr_3.0.2         nnet_7.3-20         grid_4.5.1         
## [25] sparsevctrs_0.3.4   future_1.67.0       globals_0.18.0     
## [28] MASS_7.3-65         cli_3.6.5           rmarkdown_2.29     
## [31] generics_0.1.4      rstudioapi_0.17.1   future.apply_1.20.0
## [34] cachem_1.1.0        splines_4.5.1       parallel_4.5.1     
## [37] vctrs_0.6.5         hardhat_1.4.2       Matrix_1.7-3       
## [40] jsonlite_2.0.0      listenv_0.9.1       gower_1.0.2        
## [43] jquerylib_0.1.4     glue_1.8.0          parallelly_1.45.1  
## [46] rematch2_2.1.2      codetools_0.2-20    lubridate_1.9.4    
## [49] gtable_0.3.6        GPfit_1.0-9         tibble_3.3.0       
## [52] pillar_1.11.1       furrr_0.3.1         htmltools_0.5.8.1  
## [55] ipred_0.9-15        lava_1.8.2          R6_2.6.1           
## [58] lhs_1.2.1           evaluate_1.0.5      lattice_0.22-7     
## [61] backports_1.5.0     bslib_0.9.0         class_7.3-23       
## [64] Rcpp_1.1.0          prodlim_2025.04.28  xfun_0.52          
## [67] pkgconfig_2.0.3