We analyze the Smarket dataset, which contains stock
market returns and a binary variable Direction
(Up/Down).
data(Smarket)
glimpse(Smarket)
## Rows: 1,250
## Columns: 9
## $ Year <dbl> 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, …
## $ Lag1 <dbl> 0.381, 0.959, 1.032, -0.623, 0.614, 0.213, 1.392, -0.403, 0.…
## $ Lag2 <dbl> -0.192, 0.381, 0.959, 1.032, -0.623, 0.614, 0.213, 1.392, -0…
## $ Lag3 <dbl> -2.624, -0.192, 0.381, 0.959, 1.032, -0.623, 0.614, 0.213, 1…
## $ Lag4 <dbl> -1.055, -2.624, -0.192, 0.381, 0.959, 1.032, -0.623, 0.614, …
## $ Lag5 <dbl> 5.010, -1.055, -2.624, -0.192, 0.381, 0.959, 1.032, -0.623, …
## $ Volume <dbl> 1.1913, 1.2965, 1.4112, 1.2760, 1.2057, 1.3491, 1.4450, 1.40…
## $ Today <dbl> 0.959, 1.032, -0.623, 0.614, 0.213, 1.392, -0.403, 0.027, 1.…
## $ Direction <fct> Up, Up, Down, Up, Up, Up, Down, Up, Up, Up, Down, Down, Up, …
We examine relationships between variables (excluding Direction since it’s categorical).
cor_Smarket <- Smarket %>%
select(-Direction) %>%
correlate()
## Correlation computed with
## • Method: 'pearson'
## • Missing treated using: 'pairwise.complete.obs'
cor_Smarket
## # A tibble: 8 × 9
## term Year Lag1 Lag2 Lag3 Lag4 Lag5 Volume Today
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Year NA 0.0297 0.0306 0.0332 0.0357 0.0298 0.539 0.0301
## 2 Lag1 0.0297 NA -0.0263 -0.0108 -0.00299 -0.00567 0.0409 -0.0262
## 3 Lag2 0.0306 -0.0263 NA -0.0259 -0.0109 -0.00356 -0.0434 -0.0103
## 4 Lag3 0.0332 -0.0108 -0.0259 NA -0.0241 -0.0188 -0.0418 -0.00245
## 5 Lag4 0.0357 -0.00299 -0.0109 -0.0241 NA -0.0271 -0.0484 -0.00690
## 6 Lag5 0.0298 -0.00567 -0.00356 -0.0188 -0.0271 NA -0.0220 -0.0349
## 7 Volume 0.539 0.0409 -0.0434 -0.0418 -0.0484 -0.0220 NA 0.0146
## 8 Today 0.0301 -0.0262 -0.0103 -0.00245 -0.00690 -0.0349 0.0146 NA
rplot(cor_Smarket, colours = c("red", "black", "blue"))
## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## ℹ The deprecated feature was likely used in the corrr package.
## Please report the issue at <https://github.com/tidymodels/corrr/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
💡 Insight:
Most variables are weakly correlated, meaning they don’t strongly
predict each other. :contentReferenceoaicite:0
ggplot(Smarket, aes(x = Year, y = Volume)) +
geom_point(alpha = 0.5) +
geom_smooth(se = FALSE) +
labs(title = "Volume vs Year")
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
## Warning: Failed to fit group -1.
## Caused by error in `smooth.construct.cr.smooth.spec()`:
## ! x has insufficient unique values to support 10 knots: reduce k.
💡 Insight:
Trading volume increases over time.
We now build a classification model to predict
Direction.
We follow the book:
- Train = before 2005
- Test = 2005
Smarket_train <- Smarket %>% filter(Year < 2005)
Smarket_test <- Smarket %>% filter(Year == 2005)
We use predictors:
log_spec <- logistic_reg() %>%
set_engine("glm") %>%
set_mode("classification")
log_fit <- log_spec %>%
fit(Direction ~ Lag1 + Lag2, data = Smarket_train)
log_fit
## parsnip model object
##
##
## Call: stats::glm(formula = Direction ~ Lag1 + Lag2, family = stats::binomial,
## data = data)
##
## Coefficients:
## (Intercept) Lag1 Lag2
## 0.03222 -0.05562 -0.04449
##
## Degrees of Freedom: 997 Total (i.e. Null); 995 Residual
## Null Deviance: 1383
## Residual Deviance: 1381 AIC: 1387
pred_class <- predict(log_fit, new_data = Smarket_test)
pred_prob <- predict(log_fit, new_data = Smarket_test, type = "prob")
results <- Smarket_test %>%
select(Direction) %>%
bind_cols(pred_class, pred_prob)
head(results)
## Direction .pred_class .pred_Down .pred_Up
## 1 Down Up 0.4901725 0.5098275
## 2 Down Up 0.4791763 0.5208237
## 3 Down Up 0.4667365 0.5332635
## 4 Up Up 0.4739426 0.5260574
## 5 Down Up 0.4927897 0.5072103
## 6 Up Up 0.4938612 0.5061388
conf_mat(results, truth = Direction, estimate = .pred_class)
## Truth
## Prediction Down Up
## Down 35 35
## Up 76 106
accuracy(results, truth = Direction, estimate = .pred_class)
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.560
💡 Insight:
Accuracy is usually around ~56%, meaning predictions
are only slightly better than random guessing. :contentReferenceoaicite:1
Lag1, Lag2) have
weak predictive powerThis analysis replicated the classification workflow using logistic regression on stock market data. The results show that past returns provide limited information for predicting market direction, highlighting the difficulty of forecasting financial markets.