This report replicates Sections 4.1 (The Stock Market Data) and 4.2 (Logistic Regression) from the ISLR tidymodels labs.
We use the tidymodels ecosystem for model fitting and
the Smarket data set from the ISLR package,
which contains daily percentage returns for the S&P 500 from 2001 to
2005.
library(tidymodels)
library(ISLR) # Contains the Smarket data set
library(ISLR2) # Contains the Bikeshare data set (used in later sections)
library(discrim) # Discriminant analysis models
library(poissonreg) # Poisson regression
library(corrr) # Correlation analysis
library(paletteer) # Color palettes for ggplot2The Smarket data set contains 1,250 observations and the
following variables:
"Up" or "Down" on the current day (our
response variable)## Rows: 1,250
## Columns: 9
## $ Year <dbl> 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, …
## $ Lag1 <dbl> 0.381, 0.959, 1.032, -0.623, 0.614, 0.213, 1.392, -0.403, 0.…
## $ Lag2 <dbl> -0.192, 0.381, 0.959, 1.032, -0.623, 0.614, 0.213, 1.392, -0…
## $ Lag3 <dbl> -2.624, -0.192, 0.381, 0.959, 1.032, -0.623, 0.614, 0.213, 1…
## $ Lag4 <dbl> -1.055, -2.624, -0.192, 0.381, 0.959, 1.032, -0.623, 0.614, …
## $ Lag5 <dbl> 5.010, -1.055, -2.624, -0.192, 0.381, 0.959, 1.032, -0.623, …
## $ Volume <dbl> 1.1913, 1.2965, 1.4112, 1.2760, 1.2057, 1.3491, 1.4450, 1.40…
## $ Today <dbl> 0.959, 1.032, -0.623, 0.614, 0.213, 1.392, -0.403, 0.027, 1.…
## $ Direction <fct> Up, Up, Down, Up, Up, Up, Down, Up, Up, Up, Down, Down, Up, …
We remove Direction (a factor, not numeric) and compute
the pairwise Pearson correlations among all numeric variables.
## # A tibble: 8 × 9
## term Year Lag1 Lag2 Lag3 Lag4 Lag5 Volume Today
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Year NA 0.0297 0.0306 0.0332 0.0357 0.0298 0.539 0.0301
## 2 Lag1 0.0297 NA -0.0263 -0.0108 -0.00299 -0.00567 0.0409 -0.0262
## 3 Lag2 0.0306 -0.0263 NA -0.0259 -0.0109 -0.00356 -0.0434 -0.0103
## 4 Lag3 0.0332 -0.0108 -0.0259 NA -0.0241 -0.0188 -0.0418 -0.00245
## 5 Lag4 0.0357 -0.00299 -0.0109 -0.0241 NA -0.0271 -0.0484 -0.00690
## 6 Lag5 0.0298 -0.00567 -0.00356 -0.0188 -0.0271 NA -0.0220 -0.0349
## 7 Volume 0.539 0.0409 -0.0434 -0.0418 -0.0484 -0.0220 NA 0.0146
## 8 Today 0.0301 -0.0262 -0.0103 -0.00245 -0.00690 -0.0349 0.0146 NA
Most variables are nearly uncorrelated with each other. The notable exception is Year and Volume, which show a moderate positive correlation.
For a more detailed view, we can create a heatmap manually using
stretch() to convert the correlation matrix to long
format.
cor_Smarket %>%
stretch() %>%
ggplot(aes(x, y, fill = r)) +
geom_tile() +
geom_text(aes(label = as.character(fashion(r)))) +
scale_fill_paletteer_c("scico::roma", limits = c(-1, 1), direction = -1) +
labs(
title = "Correlation Heatmap — Smarket Variables",
x = NULL,
y = NULL,
fill = "Correlation"
)We use logistic_reg() from parsnip with the
glm engine for binary classification. Setting the engine
and mode explicitly here is optional since they are the defaults, but it
is good practice.
## Logistic Regression Model Specification (classification)
##
## Computational engine: glm
We first fit the model using all six predictors (Lag1
through Lag5 and Volume) on the full
Smarket data set.
Note: The response variable
Directionmust be a factor, which it already is inSmarket.
lr_fit <- lr_spec %>%
fit(
Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume,
data = Smarket
)
lr_fit## parsnip model object
##
##
## Call: stats::glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 +
## Lag5 + Volume, family = stats::binomial, data = data)
##
## Coefficients:
## (Intercept) Lag1 Lag2 Lag3 Lag4 Lag5
## -0.126000 -0.073074 -0.042301 0.011085 0.009359 0.010313
## Volume
## 0.135441
##
## Degrees of Freedom: 1249 Total (i.e. Null); 1243 Residual
## Null Deviance: 1731
## Residual Deviance: 1728 AIC: 1742
##
## Call:
## stats::glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 +
## Lag5 + Volume, family = stats::binomial, data = data)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.126000 0.240736 -0.523 0.601
## Lag1 -0.073074 0.050167 -1.457 0.145
## Lag2 -0.042301 0.050086 -0.845 0.398
## Lag3 0.011085 0.049939 0.222 0.824
## Lag4 0.009359 0.049974 0.187 0.851
## Lag5 0.010313 0.049511 0.208 0.835
## Volume 0.135441 0.158360 0.855 0.392
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1731.2 on 1249 degrees of freedom
## Residual deviance: 1727.6 on 1243 degrees of freedom
## AIC: 1741.6
##
## Number of Fisher Scoring iterations: 3
The tidy() function returns a clean tibble of estimates,
standard errors, test statistics, and p-values.
## # A tibble: 7 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -0.126 0.241 -0.523 0.601
## 2 Lag1 -0.0731 0.0502 -1.46 0.145
## 3 Lag2 -0.0423 0.0501 -0.845 0.398
## 4 Lag3 0.0111 0.0499 0.222 0.824
## 5 Lag4 0.00936 0.0500 0.187 0.851
## 6 Lag5 0.0103 0.0495 0.208 0.835
## 7 Volume 0.135 0.158 0.855 0.392
None of the predictors have a statistically significant p-value, suggesting the model has limited explanatory power.
## # A tibble: 1,250 × 1
## .pred_class
## <fct>
## 1 Up
## 2 Down
## 3 Down
## 4 Up
## 5 Up
## 6 Up
## 7 Down
## 8 Up
## 9 Up
## 10 Down
## # ℹ 1,240 more rows
Using type = "prob" returns the estimated probability
for each class.
## # A tibble: 1,250 × 2
## .pred_Down .pred_Up
## <dbl> <dbl>
## 1 0.493 0.507
## 2 0.519 0.481
## 3 0.519 0.481
## 4 0.485 0.515
## 5 0.489 0.511
## 6 0.493 0.507
## 7 0.507 0.493
## 8 0.491 0.509
## 9 0.482 0.518
## 10 0.511 0.489
## # ℹ 1,240 more rows
augment() appends predictions to the original data frame
for easy evaluation.
## Truth
## Prediction Down Up
## Down 145 141
## Up 457 507
Since the data has a time component, a more realistic evaluation splits by year: train on 2001–2004, test on 2005.
Smarket_train <- Smarket %>% filter(Year != 2005)
Smarket_test <- Smarket %>% filter(Year == 2005)
nrow(Smarket_train)## [1] 998
## [1] 252
lr_fit2 <- lr_spec %>%
fit(
Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume,
data = Smarket_train
)## Truth
## Prediction Down Up
## Down 77 97
## Up 34 44
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.480
Accuracy drops to ~48% on unseen data — worse than random chance, indicating the model does not generalize well.
Because most predictors had high p-values, we try a simpler model
using only Lag1 and Lag2. Removing noisy
predictors can reduce variance without increasing bias.
## Truth
## Prediction Down Up
## Down 35 35
## Up 76 106
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.560
Accuracy improves to ~56% — a noticeable gain from removing uninformative predictors.
We predict Direction for two hypothetical days:
| Scenario | Lag1 | Lag2 |
|---|---|---|
| Day A | 1.2 | 1.1 |
| Day B | 1.5 | −0.8 |
Smarket_new <- tibble(
Lag1 = c(1.2, 1.5),
Lag2 = c(1.1, -0.8)
)
predict(
lr_fit3,
new_data = Smarket_new,
type = "prob"
)## # A tibble: 2 × 2
## .pred_Down .pred_Up
## <dbl> <dbl>
## 1 0.521 0.479
## 2 0.504 0.496
Both days have a slightly higher probability of "Down",
which aligns with both Lag1 values being positive
(suggesting recent gains that may reverse).
| Model | Data Used for Evaluation | Accuracy |
|---|---|---|
| Full model (all 6 predictors) — trained & tested on full data | Full Smarket | 52.2% |
| Full model (all 6 predictors) — tested on 2005 | Smarket test (2005) | 48.0% |
| Reduced model (Lag1 + Lag2) — tested on 2005 | Smarket test (2005) | 56.0% |
The reduced logistic regression model using only Lag1
and Lag2 performs best on held-out data, demonstrating that
simpler models can generalize better when many predictors lack a true
relationship with the response.