Chapter 4: Classification

Overview

This lab introduces classification models, where the response variable is a qualitative (categorical) variable rather than a continuous one. We use:

parsnip for model fitting
recipes and workflows for data transformations
tidymodels as the overarching framework

4.1 The Stock Market Data

Load Libraries

library(tidymodels)
library(ISLR)       # For the Smarket data set
library(ISLR2)      # For the Bikeshare data set
library(discrim)    # LDA, QDA, and Naive Bayes models
library(poissonreg)

We will examine the Smarket dataset. It contains several numeric predictor variables (percentage returns from the previous 5 days and volume of shares traded) plus a qualitative response variable Direction, with labels "Up" and "Down".

Correlation Analysis

We use the corrr package to compute and visualize correlations among the numeric variables. Since Direction is categorical, we remove it before computing the correlation matrix.

library(corrr)

smarket_cor <- Smarket %>%
  select(-Direction) %>%
  correlate()

smarket_cor

## # A tibble: 8 × 9
##   term      Year     Lag1     Lag2     Lag3     Lag4     Lag5  Volume    Today
##   <chr>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>   <dbl>    <dbl>
## 1 Year   NA       0.0297   0.0306   0.0332   0.0357   0.0298   0.539   0.0301 
## 2 Lag1    0.0297 NA       -0.0263  -0.0108  -0.00299 -0.00567  0.0409 -0.0262 
## 3 Lag2    0.0306 -0.0263  NA       -0.0259  -0.0109  -0.00356 -0.0434 -0.0103 
## 4 Lag3    0.0332 -0.0108  -0.0259  NA       -0.0241  -0.0188  -0.0418 -0.00245
## 5 Lag4    0.0357 -0.00299 -0.0109  -0.0241  NA       -0.0271  -0.0484 -0.00690
## 6 Lag5    0.0298 -0.00567 -0.00356 -0.0188  -0.0271  NA       -0.0220 -0.0349 
## 7 Volume  0.539   0.0409  -0.0434  -0.0418  -0.0484  -0.0220  NA       0.0146 
## 8 Today   0.0301 -0.0262  -0.0103  -0.00245 -0.00690 -0.0349   0.0146 NA

rplot(smarket_cor, colours = c("indianred2", "black", "skyblue1"))

Most variables are nearly uncorrelated with each other. The one notable pair is Year and Volume, which show a moderate positive correlation.

Heatmap-Style Correlation Chart

For a more polished visualization, we can build a heatmap manually using stretch() and ggplot2:

smarket_cor %>%
  stretch() %>%
  ggplot(aes(x = x, y = y, fill = r)) +
  geom_tile() +
  geom_text(aes(label = as.character(fashion(r)))) +
  scale_fill_gradient2(
    low      = "indianred2",
    mid      = "white",
    high     = "skyblue1",
    limits   = c(-1, 1),
    midpoint = 0
  ) +
  labs(
    title = "Correlation Heatmap — Smarket Data",
    x     = NULL,
    y     = NULL,
    fill  = "r"
  ) +
  theme_minimal()

Year vs. Volume

If we plot Year against Volume, we can see the upward trend in trading volume over time:

ggplot(Smarket, aes(x = Year, y = Volume)) +
  geom_jitter(height = 0, alpha = 0.4, color = "steelblue") +
  labs(
    title = "Volume of Shares Traded Over Time",
    x     = "Year",
    y     = "Volume"
  ) +
  theme_minimal()

4.2 Logistic Regression

Model Specification

We use logistic_reg() from parsnip to create a logistic regression model specification. The engine "glm" and mode "classification" are the defaults, but we set them explicitly for clarity.

log_reg_spec <- logistic_reg() %>%
  set_engine("glm") %>%
  set_mode("classification")

log_reg_spec

## Logistic Regression Model Specification (classification)
## 
## Computational engine: glm

Fit the Model (Full Training Set)

We model Direction as a function of the five lagged percentage returns (Lag1–Lag5) and Volume. The response variable must be a factor — Smarket$Direction already satisfies this requirement.

full_formula <- Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume

log_reg_fit <- log_reg_spec %>%
  fit(full_formula, data = Smarket)

log_reg_fit

## parsnip model object
## 
## 
## Call:  stats::glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + 
##     Lag5 + Volume, family = stats::binomial, data = data)
## 
## Coefficients:
## (Intercept)         Lag1         Lag2         Lag3         Lag4         Lag5  
##   -0.126000    -0.073074    -0.042301     0.011085     0.009359     0.010313  
##      Volume  
##    0.135441  
## 
## Degrees of Freedom: 1249 Total (i.e. Null);  1243 Residual
## Null Deviance:       1731 
## Residual Deviance: 1728  AIC: 1742

Model Summary

The underlying glm fit provides detailed output via summary():

log_reg_fit %>%
  pluck("fit") %>%
  summary()

## 
## Call:
## stats::glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + 
##     Lag5 + Volume, family = stats::binomial, data = data)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.126000   0.240736  -0.523    0.601
## Lag1        -0.073074   0.050167  -1.457    0.145
## Lag2        -0.042301   0.050086  -0.845    0.398
## Lag3         0.011085   0.049939   0.222    0.824
## Lag4         0.009359   0.049974   0.187    0.851
## Lag5         0.010313   0.049511   0.208    0.835
## Volume       0.135441   0.158360   0.855    0.392
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1731.2  on 1249  degrees of freedom
## Residual deviance: 1727.6  on 1243  degrees of freedom
## AIC: 1741.6
## 
## Number of Fisher Scoring iterations: 3

We can also extract a tidy tibble of coefficients using tidy():

tidy(log_reg_fit)

## # A tibble: 7 × 5
##   term        estimate std.error statistic p.value
##   <chr>          <dbl>     <dbl>     <dbl>   <dbl>
## 1 (Intercept) -0.126      0.241     -0.523   0.601
## 2 Lag1        -0.0731     0.0502    -1.46    0.145
## 3 Lag2        -0.0423     0.0501    -0.845   0.398
## 4 Lag3         0.0111     0.0499     0.222   0.824
## 5 Lag4         0.00936    0.0500     0.187   0.851
## 6 Lag5         0.0103     0.0495     0.208   0.835
## 7 Volume       0.135      0.158      0.855   0.392

None of the predictors have small p-values, suggesting limited predictive signal in this simple model.

Predictions

Class Predictions

predict(log_reg_fit, new_data = Smarket)

## # A tibble: 1,250 × 1
##    .pred_class
##    <fct>      
##  1 Up         
##  2 Down       
##  3 Down       
##  4 Up         
##  5 Up         
##  6 Up         
##  7 Down       
##  8 Up         
##  9 Up         
## 10 Down       
## # ℹ 1,240 more rows

The result is a tibble with a .pred_class column — a factor with the same labels as the training data.

Probability Predictions

predict(log_reg_fit, new_data = Smarket, type = "prob")

## # A tibble: 1,250 × 2
##    .pred_Down .pred_Up
##         <dbl>    <dbl>
##  1      0.493    0.507
##  2      0.519    0.481
##  3      0.519    0.481
##  4      0.485    0.515
##  5      0.489    0.511
##  6      0.493    0.507
##  7      0.507    0.493
##  8      0.491    0.509
##  9      0.482    0.518
## 10      0.511    0.489
## # ℹ 1,240 more rows

We get one probability column per class. This becomes especially useful in multi-class settings.

Confusion Matrix

Using augment() we attach predictions to the data and then compute the confusion matrix:

augment(log_reg_fit, new_data = Smarket) %>%
  conf_mat(truth = Direction, estimate = .pred_class)

##           Truth
## Prediction Down  Up
##       Down  145 141
##       Up    457 507

augment(log_reg_fit, new_data = Smarket) %>%
  conf_mat(truth    = Direction,
           estimate = .pred_class) %>%
  autoplot(type = "heatmap")

A well-performing model would have large numbers along the main diagonal and small numbers off-diagonal. Here, the model tends to over-predict "Up".

Accuracy

augment(log_reg_fit, new_data = Smarket) %>%
  accuracy(truth = Direction, estimate = .pred_class)

## # A tibble: 1 × 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy binary         0.522

An accuracy of ~52% is barely better than random chance, which is consistent with what the confusion matrix showed.

Train/Test Split

Evaluating on the same data used for training is optimistic. Since the data has a time component, we split it by year: train on all years except 2005 and test on 2005.

train_data <- Smarket %>% filter(Year != 2005)
test_data  <- Smarket %>% filter(Year == 2005)

nrow(train_data)

## [1] 998

nrow(test_data)

## [1] 252

Refit on Training Data

log_reg_fit2 <- log_reg_spec %>%
  fit(full_formula, data = train_data)

Evaluate on Test Data

augment(log_reg_fit2, new_data = test_data) %>%
  conf_mat(truth = Direction, estimate = .pred_class)

##           Truth
## Prediction Down Up
##       Down   77 97
##       Up     34 44

augment(log_reg_fit2, new_data = test_data) %>%
  accuracy(truth = Direction, estimate = .pred_class)

## # A tibble: 1 × 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy binary         0.480

Performance drops to ~48% on the held-out test set — slightly worse than random chance. This is expected when evaluating on genuinely new data.

Reduced Model: Lag1 + Lag2 Only

Since most predictors had large p-values, we try a simpler model using only Lag1 and Lag2. Removing uninformative predictors can reduce variance without increasing bias.

reduced_formula <- Direction ~ Lag1 + Lag2

log_reg_fit3 <- log_reg_spec %>%
  fit(reduced_formula, data = train_data)

augment(log_reg_fit3, new_data = test_data) %>%
  conf_mat(truth = Direction, estimate = .pred_class)

##           Truth
## Prediction Down  Up
##       Down   35  35
##       Up     76 106

augment(log_reg_fit3, new_data = test_data) %>%
  accuracy(truth = Direction, estimate = .pred_class)

## # A tibble: 1 × 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy binary         0.560

The reduced model achieves ~56% accuracy on the test set — a clear improvement over the full model.

Predicting for New Observations

Suppose we want to predict Direction for two specific scenarios:

Scenario	Lag1	Lag2
1	1.2	1.1
2	1.5	−0.8

new_obs <- tibble(
  Lag1 = c(1.2, 1.5),
  Lag2 = c(1.1, -0.8)
)

predict(log_reg_fit3, new_data = new_obs, type = "prob")

## # A tibble: 2 × 2
##   .pred_Down .pred_Up
##        <dbl>    <dbl>
## 1      0.521    0.479
## 2      0.504    0.496

Both scenarios show a slightly higher predicted probability for "Down" than "Up", though the probabilities are close to 50/50.

Summary

Model	Data Split	Accuracy
Full model (all predictors)	Train = Test	52.2%
Full model (all predictors)	Train ≠ Test	48.0%
Reduced model (Lag1 + Lag2)	Train ≠ Test	56.0%

Key takeaways:

Evaluating on training data gives an overly optimistic view of performance.
Removing low-signal predictors can improve generalization.
Even the best model here is close to random — stock market prediction is genuinely difficult!