This document replicates Sections 4.1 and 4.2 of the ISLR tidymodels labs by Emil Hvitfeldt.
Smarket stock-market
dataset through exploratory analysis and correlation visualisation.Up / Down), first on
the full dataset and then with a proper train/test split by year.library(tidymodels) # modelling framework
library(ISLR) # Smarket dataset
library(corrr) # correlation helpers
library(paletteer) # colour palettes for heatmapThe Smarket dataset contains daily percentage returns
for the S&P 500 over 2001-2005, together with Volume
(shares traded) and Direction (whether the market moved
Up or Down that day).
## Rows: 1,250
## Columns: 9
## $ Year <dbl> 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, …
## $ Lag1 <dbl> 0.381, 0.959, 1.032, -0.623, 0.614, 0.213, 1.392, -0.403, 0.…
## $ Lag2 <dbl> -0.192, 0.381, 0.959, 1.032, -0.623, 0.614, 0.213, 1.392, -0…
## $ Lag3 <dbl> -2.624, -0.192, 0.381, 0.959, 1.032, -0.623, 0.614, 0.213, 1…
## $ Lag4 <dbl> -1.055, -2.624, -0.192, 0.381, 0.959, 1.032, -0.623, 0.614, …
## $ Lag5 <dbl> 5.010, -1.055, -2.624, -0.192, 0.381, 0.959, 1.032, -0.623, …
## $ Volume <dbl> 1.1913, 1.2965, 1.4112, 1.2760, 1.2057, 1.3491, 1.4450, 1.40…
## $ Today <dbl> 0.959, 1.032, -0.623, 0.614, 0.213, 1.392, -0.403, 0.027, 1.…
## $ Direction <fct> Up, Up, Down, Up, Up, Up, Down, Up, Up, Up, Down, Down, Up, …
## Year Lag1 Lag2 Lag3 Lag4 Lag5 Volume Today Direction
## 1 2001 0.381 -0.192 -2.624 -1.055 5.010 1.1913 0.959 Up
## 2 2001 0.959 0.381 -0.192 -2.624 -1.055 1.2965 1.032 Up
## 3 2001 1.032 0.959 0.381 -0.192 -2.624 1.4112 -0.623 Down
## 4 2001 -0.623 1.032 0.959 0.381 -0.192 1.2760 0.614 Up
## 5 2001 0.614 -0.623 1.032 0.959 0.381 1.2057 0.213 Up
## 6 2001 0.213 0.614 -0.623 1.032 0.959 1.3491 1.392 Up
The response variable Direction is already a factor with
two levels, which is what parsnip classification engines require.
## Direction n
## 1 Down 602
## 2 Up 648
We remove Direction (categorical) before computing the
Pearson correlation matrix with corrr::correlate().
## # A tibble: 8 × 9
## term Year Lag1 Lag2 Lag3 Lag4 Lag5 Volume Today
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Year NA 0.0297 0.0306 0.0332 0.0357 0.0298 0.539 0.0301
## 2 Lag1 0.0297 NA -0.0263 -0.0108 -0.00299 -0.00567 0.0409 -0.0262
## 3 Lag2 0.0306 -0.0263 NA -0.0259 -0.0109 -0.00356 -0.0434 -0.0103
## 4 Lag3 0.0332 -0.0108 -0.0259 NA -0.0241 -0.0188 -0.0418 -0.00245
## 5 Lag4 0.0357 -0.00299 -0.0109 -0.0241 NA -0.0271 -0.0484 -0.00690
## 6 Lag5 0.0298 -0.00567 -0.00356 -0.0188 -0.0271 NA -0.0220 -0.0349
## 7 Volume 0.539 0.0409 -0.0434 -0.0418 -0.0484 -0.0220 NA 0.0146
## 8 Today 0.0301 -0.0262 -0.0103 -0.00245 -0.00690 -0.0349 0.0146 NA
rplot() gives a quick visual summary. Colours run from
red (negative) through black (zero) to blue (positive).
Almost all pairs are close to zero – the variables are largely uncorrelated. The notable exception is Year x Volume.
cor_Smarket %>%
stretch() %>%
ggplot(aes(x, y, fill = r)) +
geom_tile() +
geom_text(aes(label = as.character(fashion(r)))) +
scale_fill_paletteer_c("scico::roma", limits = c(-1, 1), direction = -1) +
labs(
title = "Correlation heatmap - Smarket variables",
x = NULL, y = NULL, fill = "r"
)Plotting Volume over Year reveals an upward
trend – trading volume increased across the five years in the
dataset.
ggplot(Smarket, aes(Year, Volume)) +
geom_jitter(height = 0, alpha = 0.4, colour = "steelblue") +
labs(
title = "Volume has increased over time",
x = "Year",
y = "Volume (shares traded)"
)With parsnip, model building separates specification from
fitting. logistic_reg() defaults to the
"glm" engine and "classification" mode; the
set_*() calls below are explicit but redundant – included
for clarity.
## Logistic Regression Model Specification (classification)
##
## Computational engine: glm
We predict Direction from the five lagged returns and
Volume.
lr_fit <- lr_spec %>%
fit(
Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume,
data = Smarket
)
lr_fit## parsnip model object
##
##
## Call: stats::glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 +
## Lag5 + Volume, family = stats::binomial, data = data)
##
## Coefficients:
## (Intercept) Lag1 Lag2 Lag3 Lag4 Lag5
## -0.126000 -0.073074 -0.042301 0.011085 0.009359 0.010313
## Volume
## 0.135441
##
## Degrees of Freedom: 1249 Total (i.e. Null); 1243 Residual
## Null Deviance: 1731
## Residual Deviance: 1728 AIC: 1742
##
## Call:
## stats::glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 +
## Lag5 + Volume, family = stats::binomial, data = data)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.126000 0.240736 -0.523 0.601
## Lag1 -0.073074 0.050167 -1.457 0.145
## Lag2 -0.042301 0.050086 -0.845 0.398
## Lag3 0.011085 0.049939 0.222 0.824
## Lag4 0.009359 0.049974 0.187 0.851
## Lag5 0.010313 0.049511 0.208 0.835
## Volume 0.135441 0.158360 0.855 0.392
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1731.2 on 1249 degrees of freedom
## Residual deviance: 1727.6 on 1243 degrees of freedom
## AIC: 1741.6
##
## Number of Fisher Scoring iterations: 3
tidy() extracts the coefficient table as a tibble.
## # A tibble: 7 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -0.126 0.241 -0.523 0.601
## 2 Lag1 -0.0731 0.0502 -1.46 0.145
## 3 Lag2 -0.0423 0.0501 -0.845 0.398
## 4 Lag3 0.0111 0.0499 0.222 0.824
## 5 Lag4 0.00936 0.0500 0.187 0.851
## 6 Lag5 0.0103 0.0495 0.208 0.835
## 7 Volume 0.135 0.158 0.855 0.392
None of the p-values are near conventional significance levels – consistent with a market that is very hard to predict from its own recent returns.
## # A tibble: 1,250 × 1
## .pred_class
## <fct>
## 1 Up
## 2 Down
## 3 Down
## 4 Up
## 5 Up
## 6 Up
## 7 Down
## 8 Up
## 9 Up
## 10 Down
## # ℹ 1,240 more rows
## # A tibble: 1,250 × 2
## .pred_Down .pred_Up
## <dbl> <dbl>
## 1 0.493 0.507
## 2 0.519 0.481
## 3 0.519 0.481
## 4 0.485 0.515
## 5 0.489 0.511
## 6 0.493 0.507
## 7 0.507 0.493
## 8 0.491 0.509
## 9 0.482 0.518
## 10 0.511 0.489
## # ℹ 1,240 more rows
We get one column per class (.pred_Down,
.pred_Up). This becomes especially useful in multi-class
settings.
## Truth
## Prediction Down Up
## Down 145 141
## Up 457 507
augment(lr_fit, new_data = Smarket) %>%
conf_mat(truth = Direction, estimate = .pred_class) %>%
autoplot(type = "heatmap") +
labs(title = "Confusion matrix - full dataset (training accuracy)")## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.522
Accuracy of ~52% – barely better than random, and evaluated on the same data used for training (optimistic bias).
A more realistic evaluation trains on past data and tests on future data, mimicking real deployment.
Smarket_train <- Smarket %>% filter(Year != 2005) # 2001-2004
Smarket_test <- Smarket %>% filter(Year == 2005) # 2005
cat("Training rows:", nrow(Smarket_train), "\n")## Training rows: 998
## Test rows : 252
lr_fit2 <- lr_spec %>%
fit(
Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume,
data = Smarket_train
)## Truth
## Prediction Down Up
## Down 77 97
## Up 34 44
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.480
Accuracy drops to ~48% on new data – worse than random! The model picked up noise rather than signal.
Lag3-Lag5 and Volume had high p-values. Removing them reduces variance without much increase in bias.
## Truth
## Prediction Down Up
## Down 35 35
## Up 76 106
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.560
Accuracy rises to 56% – a real improvement from dropping the noisy predictors.
What does the model predict for two hypothetical trading days?
| Scenario | Lag1 | Lag2 |
|---|---|---|
| A | 1.2 | 1.1 |
| B | 1.5 | -0.8 |
Smarket_new <- tibble(
Lag1 = c(1.2, 1.5),
Lag2 = c(1.1, -0.8)
)
predict(lr_fit3, new_data = Smarket_new, type = "prob")## # A tibble: 2 × 2
## .pred_Down .pred_Up
## <dbl> <dbl>
## 1 0.521 0.479
## 2 0.504 0.496
Both scenarios give a slightly higher probability of
Down (~52%), reflecting the difficulty of predicting market
direction from lagged returns alone.
| Section | Key finding |
|---|---|
| 4.1 | Most Smarket variables are uncorrelated;
Volume shows a clear upward trend over time. |
| 4.2 (full data) | Logistic regression with all six predictors: ~52% training accuracy – essentially random. |
| 4.2 (train/test, 6 preds) | Evaluated on held-out 2005 data: accuracy falls to ~48%, suggesting the model learned noise. |
| 4.2 (train/test, 2 preds) | Dropping Lag3-Lag5 and Volume improves test accuracy to 56% by reducing variance. |
The core lesson: markets are hard to predict, and including irrelevant predictors can hurt out-of-sample performance.
Replicated from ISLR tidymodels labs – Chapter 4