Chapter 4: Classification - Sections 4.1 and 4.2

Overview

This document replicates Sections 4.1 and 4.2 of the ISLR tidymodels labs by Emil Hvitfeldt.

4.1 explores the Smarket stock-market dataset through exploratory analysis and correlation visualisation.
4.2 fits logistic regression models to predict daily market direction (Up / Down), first on the full dataset and then with a proper train/test split by year.

Load Packages

library(tidymodels)   # modelling framework
library(ISLR)         # Smarket dataset
library(corrr)        # correlation helpers
library(paletteer)    # colour palettes for heatmap

4.1 The Stock Market Data

Dataset Overview

The Smarket dataset contains daily percentage returns for the S&P 500 over 2001-2005, together with Volume (shares traded) and Direction (whether the market moved Up or Down that day).

glimpse(Smarket)

## Rows: 1,250
## Columns: 9
## $ Year      <dbl> 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, …
## $ Lag1      <dbl> 0.381, 0.959, 1.032, -0.623, 0.614, 0.213, 1.392, -0.403, 0.…
## $ Lag2      <dbl> -0.192, 0.381, 0.959, 1.032, -0.623, 0.614, 0.213, 1.392, -0…
## $ Lag3      <dbl> -2.624, -0.192, 0.381, 0.959, 1.032, -0.623, 0.614, 0.213, 1…
## $ Lag4      <dbl> -1.055, -2.624, -0.192, 0.381, 0.959, 1.032, -0.623, 0.614, …
## $ Lag5      <dbl> 5.010, -1.055, -2.624, -0.192, 0.381, 0.959, 1.032, -0.623, …
## $ Volume    <dbl> 1.1913, 1.2965, 1.4112, 1.2760, 1.2057, 1.3491, 1.4450, 1.40…
## $ Today     <dbl> 0.959, 1.032, -0.623, 0.614, 0.213, 1.392, -0.403, 0.027, 1.…
## $ Direction <fct> Up, Up, Down, Up, Up, Up, Down, Up, Up, Up, Down, Down, Up, …

head(Smarket)

##   Year   Lag1   Lag2   Lag3   Lag4   Lag5 Volume  Today Direction
## 1 2001  0.381 -0.192 -2.624 -1.055  5.010 1.1913  0.959        Up
## 2 2001  0.959  0.381 -0.192 -2.624 -1.055 1.2965  1.032        Up
## 3 2001  1.032  0.959  0.381 -0.192 -2.624 1.4112 -0.623      Down
## 4 2001 -0.623  1.032  0.959  0.381 -0.192 1.2760  0.614        Up
## 5 2001  0.614 -0.623  1.032  0.959  0.381 1.2057  0.213        Up
## 6 2001  0.213  0.614 -0.623  1.032  0.959 1.3491  1.392        Up

The response variable Direction is already a factor with two levels, which is what parsnip classification engines require.

Smarket %>% count(Direction)

##   Direction   n
## 1      Down 602
## 2        Up 648

Correlation Analysis

We remove Direction (categorical) before computing the Pearson correlation matrix with corrr::correlate().

cor_Smarket <- Smarket %>%
  select(-Direction) %>%
  correlate()

cor_Smarket

## # A tibble: 8 × 9
##   term      Year     Lag1     Lag2     Lag3     Lag4     Lag5  Volume    Today
##   <chr>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>   <dbl>    <dbl>
## 1 Year   NA       0.0297   0.0306   0.0332   0.0357   0.0298   0.539   0.0301 
## 2 Lag1    0.0297 NA       -0.0263  -0.0108  -0.00299 -0.00567  0.0409 -0.0262 
## 3 Lag2    0.0306 -0.0263  NA       -0.0259  -0.0109  -0.00356 -0.0434 -0.0103 
## 4 Lag3    0.0332 -0.0108  -0.0259  NA       -0.0241  -0.0188  -0.0418 -0.00245
## 5 Lag4    0.0357 -0.00299 -0.0109  -0.0241  NA       -0.0271  -0.0484 -0.00690
## 6 Lag5    0.0298 -0.00567 -0.00356 -0.0188  -0.0271  NA       -0.0220 -0.0349 
## 7 Volume  0.539   0.0409  -0.0434  -0.0418  -0.0484  -0.0220  NA       0.0146 
## 8 Today   0.0301 -0.0262  -0.0103  -0.00245 -0.00690 -0.0349   0.0146 NA

Correlation Plot

rplot() gives a quick visual summary. Colours run from red (negative) through black (zero) to blue (positive).

rplot(cor_Smarket, colours = c("indianred2", "black", "skyblue1"))

Almost all pairs are close to zero – the variables are largely uncorrelated. The notable exception is Year x Volume.

Heatmap Version

cor_Smarket %>%
  stretch() %>%
  ggplot(aes(x, y, fill = r)) +
  geom_tile() +
  geom_text(aes(label = as.character(fashion(r)))) +
  scale_fill_paletteer_c("scico::roma", limits = c(-1, 1), direction = -1) +
  labs(
    title = "Correlation heatmap - Smarket variables",
    x = NULL, y = NULL, fill = "r"
  )

Year vs Volume

Plotting Volume over Year reveals an upward trend – trading volume increased across the five years in the dataset.

ggplot(Smarket, aes(Year, Volume)) +
  geom_jitter(height = 0, alpha = 0.4, colour = "steelblue") +
  labs(
    title = "Volume has increased over time",
    x = "Year",
    y = "Volume (shares traded)"
  )

4.2 Logistic Regression

Model Specification

With parsnip, model building separates specification from fitting. logistic_reg() defaults to the "glm" engine and "classification" mode; the set_*() calls below are explicit but redundant – included for clarity.

lr_spec <- logistic_reg() %>%
  set_engine("glm") %>%
  set_mode("classification")

lr_spec

## Logistic Regression Model Specification (classification)
## 
## Computational engine: glm

Fit on the Full Dataset

We predict Direction from the five lagged returns and Volume.

lr_fit <- lr_spec %>%
  fit(
    Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume,
    data = Smarket
  )

lr_fit

## parsnip model object
## 
## 
## Call:  stats::glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + 
##     Lag5 + Volume, family = stats::binomial, data = data)
## 
## Coefficients:
## (Intercept)         Lag1         Lag2         Lag3         Lag4         Lag5  
##   -0.126000    -0.073074    -0.042301     0.011085     0.009359     0.010313  
##      Volume  
##    0.135441  
## 
## Degrees of Freedom: 1249 Total (i.e. Null);  1243 Residual
## Null Deviance:       1731 
## Residual Deviance: 1728  AIC: 1742

GLM Summary

lr_fit %>%
  pluck("fit") %>%
  summary()

## 
## Call:
## stats::glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + 
##     Lag5 + Volume, family = stats::binomial, data = data)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.126000   0.240736  -0.523    0.601
## Lag1        -0.073074   0.050167  -1.457    0.145
## Lag2        -0.042301   0.050086  -0.845    0.398
## Lag3         0.011085   0.049939   0.222    0.824
## Lag4         0.009359   0.049974   0.187    0.851
## Lag5         0.010313   0.049511   0.208    0.835
## Volume       0.135441   0.158360   0.855    0.392
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1731.2  on 1249  degrees of freedom
## Residual deviance: 1727.6  on 1243  degrees of freedom
## AIC: 1741.6
## 
## Number of Fisher Scoring iterations: 3

Tidy Coefficients

tidy() extracts the coefficient table as a tibble.

tidy(lr_fit)

## # A tibble: 7 × 5
##   term        estimate std.error statistic p.value
##   <chr>          <dbl>     <dbl>     <dbl>   <dbl>
## 1 (Intercept) -0.126      0.241     -0.523   0.601
## 2 Lag1        -0.0731     0.0502    -1.46    0.145
## 3 Lag2        -0.0423     0.0501    -0.845   0.398
## 4 Lag3         0.0111     0.0499     0.222   0.824
## 5 Lag4         0.00936    0.0500     0.187   0.851
## 6 Lag5         0.0103     0.0495     0.208   0.835
## 7 Volume       0.135      0.158      0.855   0.392

None of the p-values are near conventional significance levels – consistent with a market that is very hard to predict from its own recent returns.

Predictions

Class Predictions

predict(lr_fit, new_data = Smarket)

## # A tibble: 1,250 × 1
##    .pred_class
##    <fct>      
##  1 Up         
##  2 Down       
##  3 Down       
##  4 Up         
##  5 Up         
##  6 Up         
##  7 Down       
##  8 Up         
##  9 Up         
## 10 Down       
## # ℹ 1,240 more rows

Probability Predictions

predict(lr_fit, new_data = Smarket, type = "prob")

## # A tibble: 1,250 × 2
##    .pred_Down .pred_Up
##         <dbl>    <dbl>
##  1      0.493    0.507
##  2      0.519    0.481
##  3      0.519    0.481
##  4      0.485    0.515
##  5      0.489    0.511
##  6      0.493    0.507
##  7      0.507    0.493
##  8      0.491    0.509
##  9      0.482    0.518
## 10      0.511    0.489
## # ℹ 1,240 more rows

We get one column per class (.pred_Down, .pred_Up). This becomes especially useful in multi-class settings.

Confusion Matrix and Accuracy (Full Data)

augment(lr_fit, new_data = Smarket) %>%
  conf_mat(truth = Direction, estimate = .pred_class)

##           Truth
## Prediction Down  Up
##       Down  145 141
##       Up    457 507

augment(lr_fit, new_data = Smarket) %>%
  conf_mat(truth = Direction, estimate = .pred_class) %>%
  autoplot(type = "heatmap") +
  labs(title = "Confusion matrix - full dataset (training accuracy)")

augment(lr_fit, new_data = Smarket) %>%
  accuracy(truth = Direction, estimate = .pred_class)

## # A tibble: 1 × 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy binary         0.522

Accuracy of ~52% – barely better than random, and evaluated on the same data used for training (optimistic bias).

Train / Test Split by Year

A more realistic evaluation trains on past data and tests on future data, mimicking real deployment.

Smarket_train <- Smarket %>% filter(Year != 2005)   # 2001-2004
Smarket_test  <- Smarket %>% filter(Year == 2005)   # 2005

cat("Training rows:", nrow(Smarket_train), "\n")

## Training rows: 998

cat("Test rows    :", nrow(Smarket_test),  "\n")

## Test rows    : 252

Model 2 – All Six Predictors (train to test)

lr_fit2 <- lr_spec %>%
  fit(
    Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume,
    data = Smarket_train
  )

augment(lr_fit2, new_data = Smarket_test) %>%
  conf_mat(truth = Direction, estimate = .pred_class)

##           Truth
## Prediction Down Up
##       Down   77 97
##       Up     34 44

augment(lr_fit2, new_data = Smarket_test) %>%
  accuracy(truth = Direction, estimate = .pred_class)

## # A tibble: 1 × 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy binary         0.480

Accuracy drops to ~48% on new data – worse than random! The model picked up noise rather than signal.

Model 3 – Reduced to Lag1 + Lag2 Only

Lag3-Lag5 and Volume had high p-values. Removing them reduces variance without much increase in bias.

lr_fit3 <- lr_spec %>%
  fit(
    Direction ~ Lag1 + Lag2,
    data = Smarket_train
  )

augment(lr_fit3, new_data = Smarket_test) %>%
  conf_mat(truth = Direction, estimate = .pred_class)

##           Truth
## Prediction Down  Up
##       Down   35  35
##       Up     76 106

augment(lr_fit3, new_data = Smarket_test) %>%
  accuracy(truth = Direction, estimate = .pred_class)

## # A tibble: 1 × 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy binary         0.560

Accuracy rises to 56% – a real improvement from dropping the noisy predictors.

Predicting for New Scenarios

What does the model predict for two hypothetical trading days?

Scenario	Lag1	Lag2
A	1.2	1.1
B	1.5	-0.8

Smarket_new <- tibble(
  Lag1 = c(1.2, 1.5),
  Lag2 = c(1.1, -0.8)
)

predict(lr_fit3, new_data = Smarket_new, type = "prob")

## # A tibble: 2 × 2
##   .pred_Down .pred_Up
##        <dbl>    <dbl>
## 1      0.521    0.479
## 2      0.504    0.496

Both scenarios give a slightly higher probability of Down (~52%), reflecting the difficulty of predicting market direction from lagged returns alone.

Summary

Section	Key finding
4.1	Most `Smarket` variables are uncorrelated; `Volume` shows a clear upward trend over time.
4.2 (full data)	Logistic regression with all six predictors: ~52% training accuracy – essentially random.
4.2 (train/test, 6 preds)	Evaluated on held-out 2005 data: accuracy falls to ~48%, suggesting the model learned noise.
4.2 (train/test, 2 preds)	Dropping Lag3-Lag5 and Volume improves test accuracy to 56% by reducing variance.

The core lesson: markets are hard to predict, and including irrelevant predictors can hurt out-of-sample performance.

Replicated from ISLR tidymodels labs – Chapter 4