BGEN516 - Classification Lab: Logistic Regression

Author
Affiliation

University of Montana

Published

December 10, 2025

0.1 Lab Overview

In this lab, I’ll conduct logistic regression with a dataset on weekly returns. Weekly returns of U.S. stocks, along with trading volume, provide information about market behavior and trends. Analyzing these variables allows us to explore how past weekly returns and trading activity may relate to the market’s direction in the following week. For this analysis, I’ll use lagged returns and volume as predictors. The Weekly dataset is called from the ISLR package.

To begin, I’ll examine the data through both numerical and visual summaries and look for patterns that suggest some relationship between variables. I’ll then build the model, check assumptions, and interpret the results. I’ll then split the data into training and testing sets to evaluate a logistic regression model.

0.2 Data Summaries

Let’s start with numerical summaries of the predictor and outcome variables.

Predictor N = 1,089
Lag2
    Mean (SD) 0.15 (2.36)
    Min, Max -18.20, 12.03
Lag3
    Mean (SD) 0.15 (2.36)
    Min, Max -18.20, 12.03
Lag4
    Mean (SD) 0.15 (2.36)
    Min, Max -18.20, 12.03
Lag5
    Mean (SD) 0.14 (2.36)
    Min, Max -18.20, 12.03
Volume
    Mean (SD) 1.57 (1.69)
    Min, Max 0.09, 9.33
Outcome N = 1,0891
Direction
    Down 484 (44%)
    Up 605 (56%)
1 n (%)

Looking across the computed statistics for variables representing lagged returns, we can see that they follow a consistent distribution with very similar values for range, average, and spread. The other predictor variable, Volume has a smaller range and tends to occur between 1 and 2, with some extreme values reaching above 9. Lastly, the outcome variable is slightly unbalanced with more observations of Up compared to Down.

Next, some visualization. Here I’m demonstrating the use of the GGally package which provides functions for plotting multiple variables:

  • ggpairs() - “provides two different comparisons of each pair of columns and displays either the density or count of the respective variable along the diagonal.”
  • ggscatmat() “creates a matrix with scatterplots in the lower diagonal, densities on the diagonal and correlations written in the upper diagonal.”

Figure 1 is produced with ggscatmat() to take a closer look at the variables that will be included as predictors in the first model: Lag1, Lag2, Lag3, Lag4, Lag5, and Volume.

Figure 1: Pairwise scatterplots and distributions of predictor variables show weak correlations and no clear patterns.

This plot matrix confirms that all lag variables follow a similar distribution (see the diagonal). The scatterplots suggest that there is no clear relationship between predictor variables. The correlations similarly provide evidence of no relationship between predictors. This is important as logistic regression is only appropriate in cases of low or no multicollinearity.

Figure 2 is produced with ggpairs() to take a closer look at the variables that will be included as predictors in addition to how the outcome Direction varies across these predictors.

Figure 2: Distributions and pairwise relationships of predictor variables show little difference between Down and Up days.

The plot matrix is a bit cluttered but we can look for patterns at a glance. The predictor variables have very similar patterns and overlapping values for both Down and Up. This suggests that the selected predictors will not provide much insight into market movement.

0.3 Model 1: Logistic Regression with Full Dataset

Okay, I’m ready to perform a logistic regression with Direction as the response, and the five lag variables plus Volume as predictors.

Predictor OR 95% CI p-value
Lag1 0.96 0.91, 1.01 0.118
Lag2 1.06 1.01, 1.12 0.030
Lag3 0.98 0.93, 1.04 0.547
Lag4 0.97 0.92, 1.02 0.294
Lag5 0.99 0.94, 1.04 0.583
Volume 0.98 0.91, 1.05 0.538
Abbreviations: CI = Confidence Interval, OR = Odds Ratio

From the model summary, we can see that a single predictor, Lag2, is significant at \(p =\) 0.03. Lag2 has an odds ratio of 1.06, which means that for every one-unit increase in Lag2, the odds of the outcome increase by about 6%.

All other predictors fail to meet the conventional significance threshold of \(p\) < .05. This result aligns with what we saw in the visual data summaries: the selected predictors do not appear to predict market movement.

0.3.1 Confusion Matrix

So how did the model actually perform? The confusion matrix provides information about correct and incorrect classifications:

Table 1: Confusion matrix showing the model’s predicted versus actual class labels.
Predicted
Actual
Down Up
Down 54 48
Up 430 557

Recall that the diagonal represents correct predictions. The model correctly predicted that the market would go up on 557 days and that it would go down on 54 days, for a total of 611 correct predictions. So the model correctly predicted the movement of the market 56% of the time. In other words, the model is only performing slightly better than random chance.

So what types of mistakes, or errors, are being made by the model? First, let’s clarify types of errors. There are two types of errors: false positives and false negatives. A false positive represents a case when the model predicts Up when the actual value is Down. In contrast, a false negative represents a case when the model predicts Down when the actual value is Up.

  • False positive error rate: \(48 \div 102 \approx 0.47\)

  • False negative error rate: \(430 \div 987 \approx 0.44\)

Note

Unsure about where 102 and 987 came from? We just summed the values in the rows of our confusion matrix to get the total class count: 54+48=102 and 430+557=987.

Both error rates are high at 47% and 44%. They are similar in proportion which suggests that the model is making mistakes in predicting both upward and downward market movement.

0.3.2 Model Assumptions

  1. Binary outcome: The response variable must be binary (0/1, Yes/No, Up/Down).

    This assumption is met as there are only two possible outcomes: Up or Down.

  2. Independence of observations: Each observation is assumed to be independent of the others.

    This assumption is not strictly met as we are working with time series data. Results should be interpreted with caution.

  3. Little or no multicollinearity: Predictors should not be highly correlated with each other.

    This assumption is met as predictors are not highly correlated with each other (see correlations computed in Section 0.2).

  4. Linearity of the logit: The log-odds of the outcome is assumed to be a linear combination of the predictors. Logistic regression assumes linearity of predictor variables and log odds of the outcome variable.

    This assumption is partially met (see Figure 3). Lag2 appears linear. Volume appears somewhat linear. The other predictors clearly do not have a linear relationship with the log odds of Direction.

Figure 3: Predicted logit versus predictors. Only Lag2 exhibits a clear linear relationship with the logit.
  1. Sufficient sample size: Logistic regression relies on maximum likelihood estimation, which works best with a reasonably large sample (rule of thumb: at least 10 cases per predictor).

    This assumption is met. We have six predictors and thus need at least 60 cases. The data contains 1089 observations, far exceeding our required minimum.

0.4 Model 2: Training and Testing with Logistic Regression

Next, I’ll fit the logistic regression model using a training data period from 1990 to 2008, with Lag2 as the only predictor. I’ll then generate the confusion matrix and the overall fraction of correct predictions for the held-out data (that is, the data from 2009 and 2010).

Stated another way, I’ll train and test the model on two separate subsets of the data: training is performed using only observations from the dates before 2009, and testing is performed using only the dates in 2009 and 2010. I’ll compute the predictions for 2009-10 and compare them to the actual movements of the market over that time period.

Let’s check out the confusion matrix:

Table 2: Confusion matrix showing model predictions versus actual classes on the test dataset after training.
Predicted
Actual
Down Up
Down 9 5
Up 34 56

The model correctly predicted that the market would go up on 56 days and that it would go down on 9 days, for a total of 65 correct predictions. The model correctly predicted the movement of the market 62% of the time. It was incorrect 38% of the time.

This model used only Lag2 as a predictor and was evaluated on a training and testing split, unlike the previous model which used all predictors on the full dataset. This simpler model is easier to understand than previous model with multiple predictors and was able to achieve higher predictive accuracy.