4.1 Overview of Classification

Classification is a supervised learning method used to predict qualitative (categorical) outcomes. Unlike regression, which predicts continuous values, classification focuses on assigning observations into predefined categories.

In this analysis, we use the Smarket dataset, which contains stock market data. The goal is to predict whether the market will go Up or Down based on lagged returns and trading volume.

library(tidymodels)
## Warning: package 'tidymodels' was built under R version 4.5.3
## ── Attaching packages ────────────────────────────────────── tidymodels 1.4.1 ──
## ✔ broom        1.0.10     ✔ recipes      1.3.1 
## ✔ dials        1.4.2      ✔ rsample      1.3.1 
## ✔ dplyr        1.1.4      ✔ tailor       0.1.0 
## ✔ ggplot2      4.0.2      ✔ tidyr        1.3.1 
## ✔ infer        1.1.0      ✔ tune         2.0.1 
## ✔ modeldata    1.5.1      ✔ workflows    1.3.0 
## ✔ parsnip      1.4.1      ✔ workflowsets 1.1.1 
## ✔ purrr        1.1.0      ✔ yardstick    1.3.2
## Warning: package 'dials' was built under R version 4.5.3
## Warning: package 'ggplot2' was built under R version 4.5.2
## Warning: package 'infer' was built under R version 4.5.3
## Warning: package 'modeldata' was built under R version 4.5.3
## Warning: package 'parsnip' was built under R version 4.5.3
## Warning: package 'recipes' was built under R version 4.5.2
## Warning: package 'tailor' was built under R version 4.5.3
## Warning: package 'tune' was built under R version 4.5.3
## Warning: package 'workflows' was built under R version 4.5.3
## Warning: package 'workflowsets' was built under R version 4.5.3
## Warning: package 'yardstick' was built under R version 4.5.3
## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ purrr::discard() masks scales::discard()
## ✖ dplyr::filter()  masks stats::filter()
## ✖ dplyr::lag()     masks stats::lag()
## ✖ recipes::step()  masks stats::step()
library(ISLR) # For the Smarket data set
## Warning: package 'ISLR' was built under R version 4.5.2
library(ISLR2) # For the Bikeshare data set
## Warning: package 'ISLR2' was built under R version 4.5.3
## 
## Attaching package: 'ISLR2'
## The following objects are masked from 'package:ISLR':
## 
##     Auto, Credit
library(discrim)
## Warning: package 'discrim' was built under R version 4.5.3
## 
## Attaching package: 'discrim'
## The following object is masked from 'package:dials':
## 
##     smoothness
library(poissonreg)
## Warning: package 'poissonreg' was built under R version 4.5.3
library(corrr)
## Warning: package 'corrr' was built under R version 4.5.3
cor_Smarket <- Smarket %>%
  select(-Direction) %>%
  correlate()
## Correlation computed with
## • Method: 'pearson'
## • Missing treated using: 'pairwise.complete.obs'
rplot(cor_Smarket, colours = c("indianred2", "black", "skyblue1"))
## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## ℹ The deprecated feature was likely used in the corrr package.
##   Please report the issue at <https://github.com/tidymodels/corrr/issues>.
## This warning is displayed once per session.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

library(paletteer)
## Warning: package 'paletteer' was built under R version 4.5.3
cor_Smarket %>%
  stretch() %>%
  ggplot(aes(x, y, fill = r)) +
  geom_tile() +
  geom_text(aes(label = as.character(fashion(r)))) +
  scale_fill_paletteer_c("scico::roma", limits = c(-1, 1), direction = -1)

ggplot(Smarket, aes(Year, Volume)) +
  geom_jitter(height = 0)

# Load dataset 
data(Smarket)
# Convert response variable to factor 
Smarket$Direction <- as.factor(Smarket$Direction)
# View structure
glimpse(Smarket)
## Rows: 1,250
## Columns: 9
## $ Year      <dbl> 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, …
## $ Lag1      <dbl> 0.381, 0.959, 1.032, -0.623, 0.614, 0.213, 1.392, -0.403, 0.…
## $ Lag2      <dbl> -0.192, 0.381, 0.959, 1.032, -0.623, 0.614, 0.213, 1.392, -0…
## $ Lag3      <dbl> -2.624, -0.192, 0.381, 0.959, 1.032, -0.623, 0.614, 0.213, 1…
## $ Lag4      <dbl> -1.055, -2.624, -0.192, 0.381, 0.959, 1.032, -0.623, 0.614, …
## $ Lag5      <dbl> 5.010, -1.055, -2.624, -0.192, 0.381, 0.959, 1.032, -0.623, …
## $ Volume    <dbl> 1.1913, 1.2965, 1.4112, 1.2760, 1.2057, 1.3491, 1.4450, 1.40…
## $ Today     <dbl> 0.959, 1.032, -0.623, 0.614, 0.213, 1.392, -0.403, 0.027, 1.…
## $ Direction <fct> Up, Up, Down, Up, Up, Up, Down, Up, Up, Up, Down, Down, Up, …

4.2 Logistic Regression

Logistic regression is used for classification problems where the response variable is binary. Instead of predicting a numeric value, it estimates the probability that an observation belongs to a particular class.

Why Not Linear Regression?

Linear regression is not suitable for classification because:

It can produce predictions outside the range [0,1] It does not model probabilities correctly The relationship between predictors and probability is nonlinear

Step 1: Define Logistic Regression Model

lr_spec <- logistic_reg() %>% 
  set_engine("glm") %>% 
  set_mode("classification")

Step 2: Fit the Model

lr_fit <- lr_spec %>% 
  fit(Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume,
      data = Smarket)

Step 3: Model Summary

lr_fit %>%
  pluck("fit") %>%
  summary()
## 
## Call:
## stats::glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + 
##     Lag5 + Volume, family = stats::binomial, data = data)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.126000   0.240736  -0.523    0.601
## Lag1        -0.073074   0.050167  -1.457    0.145
## Lag2        -0.042301   0.050086  -0.845    0.398
## Lag3         0.011085   0.049939   0.222    0.824
## Lag4         0.009359   0.049974   0.187    0.851
## Lag5         0.010313   0.049511   0.208    0.835
## Volume       0.135441   0.158360   0.855    0.392
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1731.2  on 1249  degrees of freedom
## Residual deviance: 1727.6  on 1243  degrees of freedom
## AIC: 1741.6
## 
## Number of Fisher Scoring iterations: 3

This output provides:

Coefficient estimates Statistical significance (p-values) Model fit statistics

augment(lr_fit, new_data = Smarket) %>%
  conf_mat(truth = Direction, estimate = .pred_class) %>%
  autoplot(type = "heatmap")

Step 4: Tidy Model Output

tidy(lr_fit)
## # A tibble: 7 × 5
##   term        estimate std.error statistic p.value
##   <chr>          <dbl>     <dbl>     <dbl>   <dbl>
## 1 (Intercept) -0.126      0.241     -0.523   0.601
## 2 Lag1        -0.0731     0.0502    -1.46    0.145
## 3 Lag2        -0.0423     0.0501    -0.845   0.398
## 4 Lag3         0.0111     0.0499     0.222   0.824
## 5 Lag4         0.00936    0.0500     0.187   0.851
## 6 Lag5         0.0103     0.0495     0.208   0.835
## 7 Volume       0.135      0.158      0.855   0.392

This makes the output easier to interpret in a structured table format

Step 5: Predictions (Class)

pred_class <- predict(lr_fit, new_data = Smarket)
head(pred_class)
## # A tibble: 6 × 1
##   .pred_class
##   <fct>      
## 1 Up         
## 2 Down       
## 3 Down       
## 4 Up         
## 5 Up         
## 6 Up

Step 6: Predictions (Probabilities)

pred_prob <- predict(lr_fit, new_data = Smarket, type = "prob")
head(pred_prob)
## # A tibble: 6 × 2
##   .pred_Down .pred_Up
##        <dbl>    <dbl>
## 1      0.493    0.507
## 2      0.519    0.481
## 3      0.519    0.481
## 4      0.485    0.515
## 5      0.489    0.511
## 6      0.493    0.507

This returns:

Probability of “Up” Probability of “Down”

Step 7: Confusion Matrix

Smarket_pred <- bind_cols(Smarket, pred_class)

conf_mat(Smarket_pred, truth = Direction, estimate = .pred_class)
##           Truth
## Prediction Down  Up
##       Down  145 141
##       Up    457 507

Step 8: Model Accuracy

accuracy(Smarket_pred, truth = Direction, estimate = .pred_class)
## # A tibble: 1 × 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy binary         0.522

Interpretation of Results: 1. The model estimates the probability of the market moving Up or Down 2. Some predictors may not be statistically significant 3. Accuracy provides a basic measure of performance, but may not fully capture model quality 4. The confusion matrix shows how well the model classifies each category

Conclusion

This analysis demonstrates how classification can be applied to financial data using logistic regression. By modeling the probability of market direction, we are able to transform a complex prediction problem into a binary classification task.

The results show that logistic regression is a simple yet effective method for classification, especially when the relationship between predictors and the outcome is not strictly linear. However, the model’s predictive performance is limited, suggesting that stock market movements are influenced by many factors beyond the included variables.

Overall, this exercise highlights three key insights. First, classification methods are essential when dealing with categorical outcomes. Second, logistic regression provides a probabilistic framework that improves upon linear regression for such tasks. Third, evaluating model performance using tools like confusion matrices and accuracy is crucial for understanding the effectiveness of predictions.

Future improvements could include using more advanced models, incorporating additional predictors, or applying train-test splits to better evaluate out-of-sample performance.