Section 4.1: The Stock Market Data

We analyze the Smarket dataset, which contains stock market returns and a binary variable Direction (Up/Down).

data(Smarket)
glimpse(Smarket)
## Rows: 1,250
## Columns: 9
## $ Year      <dbl> 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, …
## $ Lag1      <dbl> 0.381, 0.959, 1.032, -0.623, 0.614, 0.213, 1.392, -0.403, 0.…
## $ Lag2      <dbl> -0.192, 0.381, 0.959, 1.032, -0.623, 0.614, 0.213, 1.392, -0…
## $ Lag3      <dbl> -2.624, -0.192, 0.381, 0.959, 1.032, -0.623, 0.614, 0.213, 1…
## $ Lag4      <dbl> -1.055, -2.624, -0.192, 0.381, 0.959, 1.032, -0.623, 0.614, …
## $ Lag5      <dbl> 5.010, -1.055, -2.624, -0.192, 0.381, 0.959, 1.032, -0.623, …
## $ Volume    <dbl> 1.1913, 1.2965, 1.4112, 1.2760, 1.2057, 1.3491, 1.4450, 1.40…
## $ Today     <dbl> 0.959, 1.032, -0.623, 0.614, 0.213, 1.392, -0.403, 0.027, 1.…
## $ Direction <fct> Up, Up, Down, Up, Up, Up, Down, Up, Up, Up, Down, Down, Up, …

Correlation Analysis

We examine relationships between variables (excluding Direction since it’s categorical).

cor_Smarket <- Smarket %>%
  select(-Direction) %>%
  correlate()
## Correlation computed with
## • Method: 'pearson'
## • Missing treated using: 'pairwise.complete.obs'
cor_Smarket
## # A tibble: 8 × 9
##   term      Year     Lag1     Lag2     Lag3     Lag4     Lag5  Volume    Today
##   <chr>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>   <dbl>    <dbl>
## 1 Year   NA       0.0297   0.0306   0.0332   0.0357   0.0298   0.539   0.0301 
## 2 Lag1    0.0297 NA       -0.0263  -0.0108  -0.00299 -0.00567  0.0409 -0.0262 
## 3 Lag2    0.0306 -0.0263  NA       -0.0259  -0.0109  -0.00356 -0.0434 -0.0103 
## 4 Lag3    0.0332 -0.0108  -0.0259  NA       -0.0241  -0.0188  -0.0418 -0.00245
## 5 Lag4    0.0357 -0.00299 -0.0109  -0.0241  NA       -0.0271  -0.0484 -0.00690
## 6 Lag5    0.0298 -0.00567 -0.00356 -0.0188  -0.0271  NA       -0.0220 -0.0349 
## 7 Volume  0.539   0.0409  -0.0434  -0.0418  -0.0484  -0.0220  NA       0.0146 
## 8 Today   0.0301 -0.0262  -0.0103  -0.00245 -0.00690 -0.0349   0.0146 NA

Correlation Plot

rplot(cor_Smarket, colours = c("red", "black", "blue"))
## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## ℹ The deprecated feature was likely used in the corrr package.
##   Please report the issue at <https://github.com/tidymodels/corrr/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

💡 Insight:
Most variables are weakly correlated, meaning they don’t strongly predict each other. :contentReferenceoaicite:0


Volume vs Year

ggplot(Smarket, aes(x = Year, y = Volume)) +
  geom_point(alpha = 0.5) +
  geom_smooth(se = FALSE) +
  labs(title = "Volume vs Year")
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
## Warning: Failed to fit group -1.
## Caused by error in `smooth.construct.cr.smooth.spec()`:
## ! x has insufficient unique values to support 10 knots: reduce k.

💡 Insight:
Trading volume increases over time.


Section 4.2: Logistic Regression

We now build a classification model to predict Direction.


Train-Test Split

We follow the book:
- Train = before 2005
- Test = 2005

Smarket_train <- Smarket %>% filter(Year < 2005)
Smarket_test  <- Smarket %>% filter(Year == 2005)

Logistic Regression Model

We use predictors:

  • Lag1
  • Lag2
log_spec <- logistic_reg() %>%
  set_engine("glm") %>%
  set_mode("classification")

log_fit <- log_spec %>%
  fit(Direction ~ Lag1 + Lag2, data = Smarket_train)

log_fit
## parsnip model object
## 
## 
## Call:  stats::glm(formula = Direction ~ Lag1 + Lag2, family = stats::binomial, 
##     data = data)
## 
## Coefficients:
## (Intercept)         Lag1         Lag2  
##     0.03222     -0.05562     -0.04449  
## 
## Degrees of Freedom: 997 Total (i.e. Null);  995 Residual
## Null Deviance:       1383 
## Residual Deviance: 1381  AIC: 1387

Model Predictions

pred_class <- predict(log_fit, new_data = Smarket_test)

pred_prob <- predict(log_fit, new_data = Smarket_test, type = "prob")

results <- Smarket_test %>%
  select(Direction) %>%
  bind_cols(pred_class, pred_prob)

head(results)
##   Direction .pred_class .pred_Down  .pred_Up
## 1      Down          Up  0.4901725 0.5098275
## 2      Down          Up  0.4791763 0.5208237
## 3      Down          Up  0.4667365 0.5332635
## 4        Up          Up  0.4739426 0.5260574
## 5      Down          Up  0.4927897 0.5072103
## 6        Up          Up  0.4938612 0.5061388

Confusion Matrix

conf_mat(results, truth = Direction, estimate = .pred_class)
##           Truth
## Prediction Down  Up
##       Down   35  35
##       Up     76 106

Accuracy

accuracy(results, truth = Direction, estimate = .pred_class)
## # A tibble: 1 × 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy binary         0.560

💡 Insight:
Accuracy is usually around ~56%, meaning predictions are only slightly better than random guessing. :contentReferenceoaicite:1


Interpretation


Conclusion

This analysis replicated the classification workflow using logistic regression on stock market data. The results show that past returns provide limited information for predicting market direction, highlighting the difficulty of forecasting financial markets.