This lab introduces classification models, where the response variable is a qualitative (categorical) variable rather than a continuous one. We use:
parsnip for model fittingrecipes and
workflows for data transformationstidymodels as the overarching
frameworklibrary(tidymodels)
library(ISLR) # For the Smarket data set
library(ISLR2) # For the Bikeshare data set
library(discrim) # LDA, QDA, and Naive Bayes models
library(poissonreg)We will examine the Smarket dataset. It
contains several numeric predictor variables (percentage returns from
the previous 5 days and volume of shares traded) plus a qualitative
response variable Direction, with labels "Up"
and "Down".
We use the corrr package to compute and
visualize correlations among the numeric variables. Since
Direction is categorical, we remove it before computing the
correlation matrix.
## # A tibble: 8 × 9
## term Year Lag1 Lag2 Lag3 Lag4 Lag5 Volume Today
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Year NA 0.0297 0.0306 0.0332 0.0357 0.0298 0.539 0.0301
## 2 Lag1 0.0297 NA -0.0263 -0.0108 -0.00299 -0.00567 0.0409 -0.0262
## 3 Lag2 0.0306 -0.0263 NA -0.0259 -0.0109 -0.00356 -0.0434 -0.0103
## 4 Lag3 0.0332 -0.0108 -0.0259 NA -0.0241 -0.0188 -0.0418 -0.00245
## 5 Lag4 0.0357 -0.00299 -0.0109 -0.0241 NA -0.0271 -0.0484 -0.00690
## 6 Lag5 0.0298 -0.00567 -0.00356 -0.0188 -0.0271 NA -0.0220 -0.0349
## 7 Volume 0.539 0.0409 -0.0434 -0.0418 -0.0484 -0.0220 NA 0.0146
## 8 Today 0.0301 -0.0262 -0.0103 -0.00245 -0.00690 -0.0349 0.0146 NA
Most variables are nearly uncorrelated with each other. The one
notable pair is Year and
Volume, which show a moderate positive
correlation.
For a more polished visualization, we can build a heatmap manually
using stretch() and ggplot2:
smarket_cor %>%
stretch() %>%
ggplot(aes(x = x, y = y, fill = r)) +
geom_tile() +
geom_text(aes(label = as.character(fashion(r)))) +
scale_fill_gradient2(
low = "indianred2",
mid = "white",
high = "skyblue1",
limits = c(-1, 1),
midpoint = 0
) +
labs(
title = "Correlation Heatmap — Smarket Data",
x = NULL,
y = NULL,
fill = "r"
) +
theme_minimal()If we plot Year against Volume, we can see
the upward trend in trading volume over time:
ggplot(Smarket, aes(x = Year, y = Volume)) +
geom_jitter(height = 0, alpha = 0.4, color = "steelblue") +
labs(
title = "Volume of Shares Traded Over Time",
x = "Year",
y = "Volume"
) +
theme_minimal()We use logistic_reg() from
parsnip to create a logistic regression
model specification. The engine "glm" and mode
"classification" are the defaults, but we set them
explicitly for clarity.
## Logistic Regression Model Specification (classification)
##
## Computational engine: glm
We model Direction as a function of the five lagged
percentage returns (Lag1–Lag5) and
Volume. The response variable must be a
factor — Smarket$Direction already
satisfies this requirement.
full_formula <- Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume
log_reg_fit <- log_reg_spec %>%
fit(full_formula, data = Smarket)
log_reg_fit## parsnip model object
##
##
## Call: stats::glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 +
## Lag5 + Volume, family = stats::binomial, data = data)
##
## Coefficients:
## (Intercept) Lag1 Lag2 Lag3 Lag4 Lag5
## -0.126000 -0.073074 -0.042301 0.011085 0.009359 0.010313
## Volume
## 0.135441
##
## Degrees of Freedom: 1249 Total (i.e. Null); 1243 Residual
## Null Deviance: 1731
## Residual Deviance: 1728 AIC: 1742
The underlying glm fit provides detailed output via
summary():
##
## Call:
## stats::glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 +
## Lag5 + Volume, family = stats::binomial, data = data)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.126000 0.240736 -0.523 0.601
## Lag1 -0.073074 0.050167 -1.457 0.145
## Lag2 -0.042301 0.050086 -0.845 0.398
## Lag3 0.011085 0.049939 0.222 0.824
## Lag4 0.009359 0.049974 0.187 0.851
## Lag5 0.010313 0.049511 0.208 0.835
## Volume 0.135441 0.158360 0.855 0.392
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1731.2 on 1249 degrees of freedom
## Residual deviance: 1727.6 on 1243 degrees of freedom
## AIC: 1741.6
##
## Number of Fisher Scoring iterations: 3
We can also extract a tidy tibble of coefficients using
tidy():
## # A tibble: 7 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -0.126 0.241 -0.523 0.601
## 2 Lag1 -0.0731 0.0502 -1.46 0.145
## 3 Lag2 -0.0423 0.0501 -0.845 0.398
## 4 Lag3 0.0111 0.0499 0.222 0.824
## 5 Lag4 0.00936 0.0500 0.187 0.851
## 6 Lag5 0.0103 0.0495 0.208 0.835
## 7 Volume 0.135 0.158 0.855 0.392
None of the predictors have small p-values, suggesting limited predictive signal in this simple model.
## # A tibble: 1,250 × 1
## .pred_class
## <fct>
## 1 Up
## 2 Down
## 3 Down
## 4 Up
## 5 Up
## 6 Up
## 7 Down
## 8 Up
## 9 Up
## 10 Down
## # ℹ 1,240 more rows
The result is a tibble with a .pred_class column — a
factor with the same labels as the training data.
## # A tibble: 1,250 × 2
## .pred_Down .pred_Up
## <dbl> <dbl>
## 1 0.493 0.507
## 2 0.519 0.481
## 3 0.519 0.481
## 4 0.485 0.515
## 5 0.489 0.511
## 6 0.493 0.507
## 7 0.507 0.493
## 8 0.491 0.509
## 9 0.482 0.518
## 10 0.511 0.489
## # ℹ 1,240 more rows
We get one probability column per class. This becomes especially useful in multi-class settings.
Using augment() we attach predictions to the data and
then compute the confusion matrix:
## Truth
## Prediction Down Up
## Down 145 141
## Up 457 507
augment(log_reg_fit, new_data = Smarket) %>%
conf_mat(truth = Direction,
estimate = .pred_class) %>%
autoplot(type = "heatmap")A well-performing model would have large numbers along the main
diagonal and small numbers off-diagonal. Here, the model tends to
over-predict "Up".
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.522
An accuracy of ~52% is barely better than random chance, which is consistent with what the confusion matrix showed.
Evaluating on the same data used for training is optimistic. Since the data has a time component, we split it by year: train on all years except 2005 and test on 2005.
train_data <- Smarket %>% filter(Year != 2005)
test_data <- Smarket %>% filter(Year == 2005)
nrow(train_data)## [1] 998
## [1] 252
## Truth
## Prediction Down Up
## Down 77 97
## Up 34 44
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.480
Performance drops to ~48% on the held-out test set — slightly worse than random chance. This is expected when evaluating on genuinely new data.
Since most predictors had large p-values, we try a simpler model
using only Lag1 and Lag2. Removing
uninformative predictors can reduce variance without increasing
bias.
reduced_formula <- Direction ~ Lag1 + Lag2
log_reg_fit3 <- log_reg_spec %>%
fit(reduced_formula, data = train_data)
augment(log_reg_fit3, new_data = test_data) %>%
conf_mat(truth = Direction, estimate = .pred_class)## Truth
## Prediction Down Up
## Down 35 35
## Up 76 106
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.560
The reduced model achieves ~56% accuracy on the test set — a clear improvement over the full model.
Suppose we want to predict Direction for two specific
scenarios:
| Scenario | Lag1 | Lag2 |
|---|---|---|
| 1 | 1.2 | 1.1 |
| 2 | 1.5 | −0.8 |
new_obs <- tibble(
Lag1 = c(1.2, 1.5),
Lag2 = c(1.1, -0.8)
)
predict(log_reg_fit3, new_data = new_obs, type = "prob")## # A tibble: 2 × 2
## .pred_Down .pred_Up
## <dbl> <dbl>
## 1 0.521 0.479
## 2 0.504 0.496
Both scenarios show a slightly higher predicted probability for
"Down" than "Up", though the probabilities are
close to 50/50.
| Model | Data Split | Accuracy |
|---|---|---|
| Full model (all predictors) | Train = Test | 52.2% |
| Full model (all predictors) | Train ≠ Test | 48.0% |
| Reduced model (Lag1 + Lag2) | Train ≠ Test | 56.0% |
Key takeaways: