This document replicates Sections 4.1 (The Stock Market Data) and 4.2 (Logistic Regression) from Emil Hvitfeldt’s ISLR tidymodels labs.
We will examine the Smarket data set, which contains a
number of numeric variables plus a variable called
Direction with labels "Up" and
"Down".
We use the corrr package to explore correlations between
variables. We remove Direction since it is not numeric.
## # A tibble: 8 × 9
## term Year Lag1 Lag2 Lag3 Lag4 Lag5 Volume Today
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Year NA 0.0297 0.0306 0.0332 0.0357 0.0298 0.539 0.0301
## 2 Lag1 0.0297 NA -0.0263 -0.0108 -0.00299 -0.00567 0.0409 -0.0262
## 3 Lag2 0.0306 -0.0263 NA -0.0259 -0.0109 -0.00356 -0.0434 -0.0103
## 4 Lag3 0.0332 -0.0108 -0.0259 NA -0.0241 -0.0188 -0.0418 -0.00245
## 5 Lag4 0.0357 -0.00299 -0.0109 -0.0241 NA -0.0271 -0.0484 -0.00690
## 6 Lag5 0.0298 -0.00567 -0.00356 -0.0188 -0.0271 NA -0.0220 -0.0349
## 7 Volume 0.539 0.0409 -0.0434 -0.0418 -0.0484 -0.0220 NA 0.0146
## 8 Today 0.0301 -0.0262 -0.0103 -0.00245 -0.00690 -0.0349 0.0146 NA
Correlation chart. Most values are very close to 0. Year and Volume appear quite correlated.
Most variables are nearly uncorrelated with each other. The notable
exception is Year and Volume, which show a
modest positive correlation.
We use logistic_reg() from the parsnip
package with the glm engine.
We model Direction based on the 5 previous days’
percentage returns plus volume of shares traded.
lr_fit <- lr_spec %>%
fit(
Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume,
data = Smarket
)
lr_fit## parsnip model object
##
##
## Call: stats::glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 +
## Lag5 + Volume, family = stats::binomial, data = data)
##
## Coefficients:
## (Intercept) Lag1 Lag2 Lag3 Lag4 Lag5
## -0.126000 -0.073074 -0.042301 0.011085 0.009359 0.010313
## Volume
## 0.135441
##
## Degrees of Freedom: 1249 Total (i.e. Null); 1243 Residual
## Null Deviance: 1731
## Residual Deviance: 1728 AIC: 1742
##
## Call:
## stats::glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 +
## Lag5 + Volume, family = stats::binomial, data = data)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.126000 0.240736 -0.523 0.601
## Lag1 -0.073074 0.050167 -1.457 0.145
## Lag2 -0.042301 0.050086 -0.845 0.398
## Lag3 0.011085 0.049939 0.222 0.824
## Lag4 0.009359 0.049974 0.187 0.851
## Lag5 0.010313 0.049511 0.208 0.835
## Volume 0.135441 0.158360 0.855 0.392
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1731.2 on 1249 degrees of freedom
## Residual deviance: 1727.6 on 1243 degrees of freedom
## AIC: 1741.6
##
## Number of Fisher Scoring iterations: 3
## # A tibble: 7 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -0.126 0.241 -0.523 0.601
## 2 Lag1 -0.0731 0.0502 -1.46 0.145
## 3 Lag2 -0.0423 0.0501 -0.845 0.398
## 4 Lag3 0.0111 0.0499 0.222 0.824
## 5 Lag4 0.00936 0.0500 0.187 0.851
## 6 Lag5 0.0103 0.0495 0.208 0.835
## 7 Volume 0.135 0.158 0.855 0.392
## # A tibble: 1,250 × 1
## .pred_class
## <fct>
## 1 Up
## 2 Down
## 3 Down
## 4 Up
## 5 Up
## 6 Up
## 7 Down
## 8 Up
## 9 Up
## 10 Down
## # ℹ 1,240 more rows
## # A tibble: 1,250 × 2
## .pred_Down .pred_Up
## <dbl> <dbl>
## 1 0.493 0.507
## 2 0.519 0.481
## 3 0.519 0.481
## 4 0.485 0.515
## 5 0.489 0.511
## 6 0.493 0.507
## 7 0.507 0.493
## 8 0.491 0.509
## 9 0.482 0.518
## 10 0.511 0.489
## # ℹ 1,240 more rows
## Truth
## Prediction Down Up
## Down 145 141
## Up 457 507
To get a more realistic assessment, we split by year: train on all years except 2005, test on 2005.
Smarket_train <- Smarket %>%
filter(Year != 2005)
Smarket_test <- Smarket %>%
filter(Year == 2005)
dim(Smarket_train)## [1] 998 9
## [1] 252 9
lr_fit2 <- lr_spec %>%
fit(
Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume,
data = Smarket_train
)## Truth
## Prediction Down Up
## Down 77 97
## Up 34 44
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.480
The out-of-sample accuracy drops to ~48%, worse than random chance — this model is not generalizing well.
Since most predictors had large p-values, we drop them and keep only
Lag1 and Lag2.
lr_fit3 <- lr_spec %>%
fit(
Direction ~ Lag1 + Lag2,
data = Smarket_train
)
augment(lr_fit3, new_data = Smarket_test) %>%
conf_mat(truth = Direction, estimate = .pred_class)## Truth
## Prediction Down Up
## Down 35 35
## Up 76 106
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.560
Accuracy improves to ~56% with just two predictors — removing irrelevant variables reduces variance.
We predict Direction for two hypothetical scenarios:
| Scenario | Lag1 | Lag2 |
|---|---|---|
| 1 | 1.2 | 1.1 |
| 2 | 1.5 | -0.8 |
Smarket_new <- tibble(
Lag1 = c(1.2, 1.5),
Lag2 = c(1.1, -0.8)
)
predict(
lr_fit3,
new_data = Smarket_new,
type = "prob"
)## # A tibble: 2 × 2
## .pred_Down .pred_Up
## <dbl> <dbl>
## 1 0.521 0.479
## 2 0.504 0.496
Both scenarios predict a slightly higher probability of
"Down" — consistent with the pattern observed in the
data.
| Section | Key Finding |
|---|---|
| 4.1 | Year and Volume are the most correlated
pair; other variables are nearly uncorrelated |
| 4.2 (full model, in-sample) | Accuracy ~52%; model is over-predicting "Up" |
| 4.2 (full model, test) | Accuracy drops to ~48% out-of-sample |
| 4.2 (reduced model, test) | Accuracy improves to ~56% using only Lag1 +
Lag2 |
Conclusion: For the Smarket data, a simpler logistic regression model with fewer predictors generalizes better. However, predicting stock market direction remains inherently difficult.
Replicated from ISLR tidymodels labs by Emil Hvitfeldt