HW5_Maulana Ahmad Fahrezi_114035115_Application of Financial Software

4.1 We load the tidymodels for modeling functions, ISLR and ISLR2 for data sets, discrim to give us access to discriminant analysis models such as LDA and QDA as well as the Naive Bayes model and poissonreg for Poisson Regression.

library(tidymodels)

## Warning: package 'tidymodels' was built under R version 4.5.3

## ── Attaching packages ────────────────────────────────────── tidymodels 1.4.1 ──

## ✔ broom        1.0.10     ✔ recipes      1.3.1 
## ✔ dials        1.4.2      ✔ rsample      1.3.1 
## ✔ dplyr        1.1.4      ✔ tailor       0.1.0 
## ✔ ggplot2      4.0.2      ✔ tidyr        1.3.1 
## ✔ infer        1.1.0      ✔ tune         2.0.1 
## ✔ modeldata    1.5.1      ✔ workflows    1.3.0 
## ✔ parsnip      1.4.1      ✔ workflowsets 1.1.1 
## ✔ purrr        1.1.0      ✔ yardstick    1.3.2

## Warning: package 'dials' was built under R version 4.5.3

## Warning: package 'ggplot2' was built under R version 4.5.3

## Warning: package 'infer' was built under R version 4.5.3

## Warning: package 'modeldata' was built under R version 4.5.3

## Warning: package 'parsnip' was built under R version 4.5.3

## Warning: package 'tailor' was built under R version 4.5.3

## Warning: package 'tune' was built under R version 4.5.3

## Warning: package 'workflows' was built under R version 4.5.3

## Warning: package 'workflowsets' was built under R version 4.5.3

## Warning: package 'yardstick' was built under R version 4.5.3

## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ purrr::discard() masks scales::discard()
## ✖ dplyr::filter()  masks stats::filter()
## ✖ dplyr::lag()     masks stats::lag()
## ✖ recipes::step()  masks stats::step()

library(ISLR)

## Warning: package 'ISLR' was built under R version 4.5.3

library(ISLR2)

## Warning: package 'ISLR2' was built under R version 4.5.3

## 
## Attaching package: 'ISLR2'

## The following objects are masked from 'package:ISLR':
## 
##     Auto, Credit

library(discrim)

## Warning: package 'discrim' was built under R version 4.5.3

## 
## Attaching package: 'discrim'

## The following object is masked from 'package:dials':
## 
##     smoothness

library(poissonreg)

## Warning: package 'poissonreg' was built under R version 4.5.3

library(corrr)

## Warning: package 'corrr' was built under R version 4.5.3

library(paletteer)

## Warning: package 'paletteer' was built under R version 4.5.3

cor_Smarket <- Smarket %>%
  select(-Direction) %>%
  correlate()

## Correlation computed with
## • Method: 'pearson'
## • Missing treated using: 'pairwise.complete.obs'

rplot(cor_Smarket, colours = c("indianred2", "black", "skyblue1"))

## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## ℹ The deprecated feature was likely used in the corrr package.
##   Please report the issue at <https://github.com/tidymodels/corrr/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

cor_Smarket %>%
  stretch() %>%
  ggplot(aes(x, y, fill = r)) +
  geom_tile() +
  geom_text(aes(label = as.character(fashion(r)))) +
  scale_fill_paletteer_c("scico::roma", limits = c(-1, 1), direction = -1)

ggplot(Smarket, aes(Year, Volume)) +
  geom_jitter(height = 0)

# Load dataset 
data(Smarket)

# Convert response variable to factor 
Smarket$Direction <- as.factor(Smarket$Direction)

# View structure
glimpse(Smarket)

## Rows: 1,250
## Columns: 9
## $ Year      <dbl> 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, …
## $ Lag1      <dbl> 0.381, 0.959, 1.032, -0.623, 0.614, 0.213, 1.392, -0.403, 0.…
## $ Lag2      <dbl> -0.192, 0.381, 0.959, 1.032, -0.623, 0.614, 0.213, 1.392, -0…
## $ Lag3      <dbl> -2.624, -0.192, 0.381, 0.959, 1.032, -0.623, 0.614, 0.213, 1…
## $ Lag4      <dbl> -1.055, -2.624, -0.192, 0.381, 0.959, 1.032, -0.623, 0.614, …
## $ Lag5      <dbl> 5.010, -1.055, -2.624, -0.192, 0.381, 0.959, 1.032, -0.623, …
## $ Volume    <dbl> 1.1913, 1.2965, 1.4112, 1.2760, 1.2057, 1.3491, 1.4450, 1.40…
## $ Today     <dbl> 0.959, 1.032, -0.623, 0.614, 0.213, 1.392, -0.403, 0.027, 1.…
## $ Direction <fct> Up, Up, Down, Up, Up, Up, Down, Up, Up, Up, Down, Down, Up, …

4.2 Logistic Regression

Logistic regression is used for classification problems where the response variable is binary. Instead of predicting a numeric value, it estimates the probability that an observation belongs to a particular class.

Why Not Linear Regression?

Linear regression is not suitable for classification because:

It can produce predictions outside the range [0,1] It does not model probabilities correctly The relationship between predictors and probability is nonlinear

Step 1: Define Logistic Regression Model

lr_spec <- logistic_reg() %>% 
  set_engine("glm") %>% 
  set_mode("classification")

Step 2: Fit the Model

lr_fit <- lr_spec %>% 
  fit(Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume,
      data = Smarket)

Step 3: Model Summary

lr_fit %>%
  pluck("fit") %>%
  summary()

## 
## Call:
## stats::glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + 
##     Lag5 + Volume, family = stats::binomial, data = data)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.126000   0.240736  -0.523    0.601
## Lag1        -0.073074   0.050167  -1.457    0.145
## Lag2        -0.042301   0.050086  -0.845    0.398
## Lag3         0.011085   0.049939   0.222    0.824
## Lag4         0.009359   0.049974   0.187    0.851
## Lag5         0.010313   0.049511   0.208    0.835
## Volume       0.135441   0.158360   0.855    0.392
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1731.2  on 1249  degrees of freedom
## Residual deviance: 1727.6  on 1243  degrees of freedom
## AIC: 1741.6
## 
## Number of Fisher Scoring iterations: 3

This output provides:

Coefficient estimates Statistical significance (p-values) Model fit statistics

augment(lr_fit, new_data = Smarket) %>%
  conf_mat(truth = Direction, estimate = .pred_class) %>%
  autoplot(type = "heatmap")

Step 4: Tidy Model Output

tidy(lr_fit)

## # A tibble: 7 × 5
##   term        estimate std.error statistic p.value
##   <chr>          <dbl>     <dbl>     <dbl>   <dbl>
## 1 (Intercept) -0.126      0.241     -0.523   0.601
## 2 Lag1        -0.0731     0.0502    -1.46    0.145
## 3 Lag2        -0.0423     0.0501    -0.845   0.398
## 4 Lag3         0.0111     0.0499     0.222   0.824
## 5 Lag4         0.00936    0.0500     0.187   0.851
## 6 Lag5         0.0103     0.0495     0.208   0.835
## 7 Volume       0.135      0.158      0.855   0.392

This makes the output easier to interpret in a structured table format

Step 5: Predictions (Class)

pred_class <- predict(lr_fit, new_data = Smarket)
head(pred_class)

## # A tibble: 6 × 1
##   .pred_class
##   <fct>      
## 1 Up         
## 2 Down       
## 3 Down       
## 4 Up         
## 5 Up         
## 6 Up

Step 6: Predictions (Probabilities)

pred_prob <- predict(lr_fit, new_data = Smarket, type = "prob")
head(pred_prob)

## # A tibble: 6 × 2
##   .pred_Down .pred_Up
##        <dbl>    <dbl>
## 1      0.493    0.507
## 2      0.519    0.481
## 3      0.519    0.481
## 4      0.485    0.515
## 5      0.489    0.511
## 6      0.493    0.507

This returns:

Probability of “Up” Probability of “Down”

Step 7: Confusion Matrix

Smarket_pred <- bind_cols(Smarket, pred_class)

conf_mat(Smarket_pred, truth = Direction, estimate = .pred_class)

##           Truth
## Prediction Down  Up
##       Down  145 141
##       Up    457 507

Step 8: Model Accuracy

accuracy(Smarket_pred, truth = Direction, estimate = .pred_class)

## # A tibble: 1 × 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy binary         0.522

Interpretation of Results: 1. The model estimates the probability of the market moving Up or Down 2. Some predictors may not be statistically significant 3. Accuracy provides a basic measure of performance, but may not fully capture model quality 4. The confusion matrix shows how well the model classifies each category

Conclusion

This analysis demonstrates how classification can be applied to financial data using logistic regression. By modeling the probability of market direction, we are able to transform a complex prediction problem into a binary classification task.

The results show that logistic regression is a simple yet effective method for classification, especially when the relationship between predictors and the outcome is not strictly linear. However, the model’s predictive performance is limited, suggesting that stock market movements are influenced by many factors beyond the included variables.

Overall, this exercise highlights three key insights. First, classification methods are essential when dealing with categorical outcomes. Second, logistic regression provides a probabilistic framework that improves upon linear regression for such tasks. Third, evaluating model performance using tools like confusion matrices and accuracy is crucial for understanding the effectiveness of predictions.

HW5_Maulana Ahmad Fahrezi_114035115_Application of Financial Software

Maulana Ahmad Fahrezi

2026-03-28