Multiple Logistic Regression: Insights and Applications

When I model binary outcomes using multiple predictors, I recognize that extending the logistic regression framework from a single predictor to multiple predictors is crucial for capturing complex relationships. The general model can be expressed as:

When I work with logistic regression, I always focus on predicting the probability of a binary outcome based on the predictors I include in the model. I like that logistic regression transforms the relationship between the predictors and the probability into a form that ensures probabilities remain valid, always between 0 and 1. This makes the model not only practical but also intuitive for me to interpret.

I find it particularly helpful to think about probabilities in terms of odds. When I calculate odds, I’m comparing the likelihood of an event happening to the likelihood of it not happening. By taking the logarithm of those odds, I create a linear relationship between the predictors I use and the transformed outcome. For me, this is a neat and logical way to make sense of how each predictor influences the overall outcome.

When I estimate the coefficients, I rely on maximum likelihood estimation because it ensures the probabilities I predict are as close as possible to the actual observed data. I like how this method gives me a way to quantify the impact of each predictor while keeping the model interpretable. Every time I analyze the results, I pay close attention to what the coefficients tell me about how the predictors interact with the outcome. It’s this balance of mathematical rigor and practical interpretability that makes logistic regression one of my go-to tools.

Analyzing the Role of Predictors

Confounding and Correlations

When predictors are correlated, the coefficients can reveal different relationships depending on whether one or multiple predictors are included in the model. For example, consider a case where we predict the probability of credit default using balance, income, and student status (a categorical variable encoded as a dummy variable). The estimated coefficients might look like this:

The coefficients tell a story about the relationships:

Balance has a strong, positive association with default. A higher balance increases the log-odds of default, which aligns with financial intuition.
Student Status shows a negative coefficient, meaning that students are less likely to default than non-students, given the same balance and income.
Income, interestingly, appears insignificant in predicting default after accounting for the other variables.

The Paradox of Single vs. Multiple Predictors

The paradox arises when we compare results from single-predictor and multiple-predictor models. In a single-variable logistic regression, student status might appear positively associated with default rates because students generally carry higher balances, which correlate with higher default rates. However, in the multiple logistic regression, the coefficient for student status turns negative when balance is accounted for. This is an example of confounding, where the relationship between one variable and the outcome is influenced by another variable.

Predictions from the Model

Using the estimated coefficients, I can calculate the predicted probability of default for specific individuals. For example:

Student with a Balance of $1,500 and an Income of $40,000:
Non-Student with the Same Balance and Income:

These predictions show that, for the same balance and income, students are less likely to default than non-students.

# Load necessary libraries
# I need ggplot2 for visualization and dplyr for data manipulation. These are essential for clean, clear analysis.
if (!requireNamespace("ggplot2", quietly = TRUE)) install.packages("ggplot2")
if (!requireNamespace("dplyr", quietly = TRUE)) install.packages("dplyr")
if (!requireNamespace("MASS", quietly = TRUE)) install.packages("MASS")
library(ggplot2)
library(dplyr)


Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

# Step 1: Simulating the Data
# I’m creating a dataset that mimics real-world conditions where default depends on balance, income, and student status.
# This allows me to control the complexity of the data while ensuring it reflects logical relationships.
set.seed(123)  # Setting a seed so I can reproduce these results every time I run the code.

n <- 1000  # I chose 1000 observations to ensure sufficient data for a robust logistic regression model.

# Simulating the predictors
balance <- rnorm(n, mean = 1500, sd = 500)  # Balance values are normally distributed around $1500.
income <- rnorm(n, mean = 50, sd = 15)  # Income is in thousands, making it realistic for this scenario.
student <- rbinom(n, size = 1, prob = 0.3)  # About 30% of the data represents students, a reasonable proportion.

# Logistic function for default
# Using a logistic function ensures that probabilities remain between 0 and 1, which is critical for a binary outcome.
logit <- -10.869 + 0.0057 * balance + 0.003 * income - 0.6468 * student
default <- rbinom(n, size = 1, prob = 1 / (1 + exp(-logit)))  # Generating the binary response variable.

# Combining the data into a single frame for easier manipulation and visualization.
data <- data.frame(balance, income, student = factor(student, levels = c(0, 1), labels = c("Non-Student", "Student")), default)
head(data)  # Checking the first few rows to ensure the data looks as expected.

   balance   income     student default
1 1219.762 35.06302 Non-Student       0
2 1384.911 34.40067     Student       1
3 2279.354 49.73030 Non-Student       1
4 1535.254 48.01737     Student       0
5 1564.644 11.75986 Non-Student       0
6 2357.532 65.60860 Non-Student       1

# Step 2: Exploratory Data Analysis
# I want to visualize the relationship between balance and default to get a sense of the data distribution and potential trends.
ggplot(data, aes(x = balance, y = default)) +
  geom_jitter(height = 0.1, alpha = 0.5) +  # Adding jitter to avoid overlapping points and make patterns more visible.
  labs(
    title = "Scatterplot of Balance vs Default",
    x = "Balance",
    y = "Default (0 = No, 1 = Yes)"
  ) +
  theme_minimal()  # Using a minimal theme for clarity.

# Step 3: Fitting the Multiple Logistic Regression Model
# Here, I include balance, income, and student status as predictors to account for their combined effects on default.
logistic_model <- glm(default ~ balance + income + student, family = binomial, data = data)
summary(logistic_model)  # Viewing the model summary to interpret coefficients and significance levels.


Call:
glm(formula = default ~ balance + income + student, family = binomial, 
    data = data)

Coefficients:
                 Estimate Std. Error z value Pr(>|z|)    
(Intercept)    -1.140e+01  8.300e-01 -13.736  < 2e-16 ***
balance         5.903e-03  4.146e-04  14.236  < 2e-16 ***
income          9.725e-03  6.862e-03   1.417    0.156    
studentStudent -9.908e-01  2.513e-01  -3.942 8.08e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1126.86  on 999  degrees of freedom
Residual deviance:  584.72  on 996  degrees of freedom
AIC: 592.72

Number of Fisher Scoring iterations: 6

# Step 4: Predictions
# To understand the impact of predictors, I calculate predicted probabilities for specific cases.
student_pred <- data.frame(balance = 1500, income = 40, student = "Student")  # Case: a student with $1500 balance and $40k income.
non_student_pred <- data.frame(balance = 1500, income = 40, student = "Non-Student")  # Non-student counterpart.

# Using the model to predict probabilities for these cases.
student_prob <- predict(logistic_model, newdata = student_pred, type = "response")
non_student_prob <- predict(logistic_model, newdata = non_student_pred, type = "response")

# Comparing predictions to understand the effect of student status while holding balance and income constant.
cat("Predicted probability of default for a student with balance = $1500 and income = $40k: ", round(student_prob, 3), "\n")

Predicted probability of default for a student with balance = $1500 and income = $40k:  0.041

cat("Predicted probability of default for a non-student with balance = $1500 and income = $40k: ", round(non_student_prob, 3), "\n")

Predicted probability of default for a non-student with balance = $1500 and income = $40k:  0.104

# Step 5: Visualization of Predicted Probabilities
# I want to visualize how the predicted probabilities change across a range of balance values.
balance_seq <- seq(min(data$balance), max(data$balance), length.out = 1000)  # Generating a sequence of balance values.
predicted_probs <- predict(logistic_model, newdata = data.frame(balance = balance_seq, income = mean(data$income), student = "Student"), type = "response")

# Plotting the logistic regression curve to show the predicted probabilities.
ggplot(data, aes(x = balance, y = default)) +
  geom_jitter(height = 0.1, alpha = 0.5) +
  geom_line(data = data.frame(balance = balance_seq, predicted_probs), 
            aes(x = balance, y = predicted_probs), color = "blue", size = 1) +
  labs(
    title = "Logistic Regression: Predicted Probability of Default",
    x = "Balance",
    y = "Probability of Default"
  ) +
  theme_minimal()

Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.

# Step 6: Confounding Analysis
# I suspect that balance and student status are correlated, so I explore their relationship to uncover potential confounding effects.

# Visualizing default rates by student status over balance.
ggplot(data, aes(x = balance, y = default, color = student)) +
  geom_smooth(method = "loess", se = FALSE) +  # Using a smooth line to show trends clearly.
  labs(
    title = "Default Rates by Student Status",
    x = "Balance",
    y = "Default Rate"
  ) +
  theme_minimal()

`geom_smooth()` using formula = 'y ~ x'

# Checking the distribution of balance across student and non-student groups.
# Boxplots make it easy to see if students generally have higher balances, which could explain their higher default rates.
ggplot(data, aes(x = student, y = balance, fill = student)) +
  geom_boxplot() +
  labs(
    title = "Distribution of Balance by Student Status",
    x = "Student Status",
    y = "Balance"
  ) +
  theme_minimal()