Logistic Regression Modeling the Probability of Default

Author

Avery Holloman

Logistic Regression: Modeling the Probability of Default

Abstract

Logistic regression is a powerful method for modeling binary outcomes. Unlike linear regression, logistic regression uses the logistic function to ensure predicted probabilities stay within the range [0, 1]. In this analysis, I apply logistic regression to predict the probability of credit default based on balance. I explain the logistic model’s formulation and fit it to a simulated dataset.

# Load necessary libraries
#if (!requireNamespace("ggplot2", quietly = TRUE)) install.packages("ggplot2")
#if (!requireNamespace("dplyr", quietly = TRUE)) install.packages("dplyr")
library(ggplot2)
library(dplyr)


Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

# Step 1: Simulating the Data
# I simulate a dataset where default depends on balance
set.seed(42)  # Ensuring reproducibility

n <- 500  # Number of observations
balance <- rnorm(n, mean = 1500, sd = 500)  # Simulating balance values
default <- rbinom(n, size = 1, prob = 1 / (1 + exp(-(-5 + 0.003 * balance))))  # Logistic model for default

data <- data.frame(balance = balance, default = default)
head(data)  # Display the first few rows

   balance default
1 2185.479       0
2 1217.651       0
3 1681.564       0
4 1816.431       1
5 1702.134       1
6 1446.938       0

# Step 2: Exploratory Data Analysis
# Visualizing the relationship between balance and default
ggplot(data, aes(x = balance, y = default)) +
  geom_jitter(height = 0.1, alpha = 0.5) +
  labs(
    title = "Scatterplot of Balance vs Default",
    x = "Balance",
    y = "Default (0 = No, 1 = Yes)"
  ) +
  theme_minimal()

# Step 3: Logistic Regression Model
# Fitting the logistic regression model
logistic_model <- glm(default ~ balance, family = binomial, data = data)
summary(logistic_model)  # Viewing the model summary


Call:
glm(formula = default ~ balance, family = binomial, data = data)

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept) -5.2040989  0.4981284  -10.45   <2e-16 ***
balance      0.0033042  0.0003188   10.36   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 687.73  on 499  degrees of freedom
Residual deviance: 509.84  on 498  degrees of freedom
AIC: 513.84

Number of Fisher Scoring iterations: 5

# Interpreting Coefficients
# Logit(p(X)) = β0 + β1 * X
# Odds = e^(β0 + β1 * X)
# For balance = 2000, I calculate the predicted probability
log_odds <- coef(logistic_model)[1] + coef(logistic_model)[2] * 2000
odds <- exp(log_odds)  # Calculating the odds
probability <- odds / (1 + odds)  # Converting odds to probability
probability

(Intercept) 
  0.8028732

cat("Predicted probability of default for a balance of $2000: ", round(probability, 3), "\n")

Predicted probability of default for a balance of $2000:  0.803

# Step 4: Visualization of Predicted Probabilities
# Creating a sequence of balance values for prediction
balance_seq <- seq(min(data$balance), max(data$balance), length.out = 1000)
predicted_prob <- predict(logistic_model, newdata = data.frame(balance = balance_seq), type = "response")

# Plotting the logistic curve
ggplot(data, aes(x = balance, y = default)) +
  geom_jitter(height = 0.1, alpha = 0.5) +
  geom_line(data = data.frame(balance = balance_seq, predicted_prob = predicted_prob),
            aes(x = balance, y = predicted_prob), color = "blue", size = 1) +
  labs(
    title = "Logistic Regression Probability of Default",
    x = "Balance",
    y = "Probability of Default"
  ) +
  theme_minimal()

Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.

# Step 5: Model Validation
# Checking goodness-of-fit using McFadden's pseudo R-squared
null_model <- glm(default ~ 1, family = binomial, data = data)
pseudo_r2 <- 1 - (logLik(logistic_model) / logLik(null_model))
cat("McFadden's Pseudo R-squared: ", round(pseudo_r2, 3), "\n")

McFadden's Pseudo R-squared:  0.259

# Step 6: Predictions
# Classify based on a 0.5 probability threshold
data <- data %>%
  mutate(
    predicted_prob = predict(logistic_model, type = "response"),
    predicted_class = ifelse(predicted_prob > 0.5, 1, 0)
  )

# Confusion Matrix
confusion_matrix <- table(Actual = data$default, Predicted = data$predicted_class)
confusion_matrix

      Predicted
Actual   0   1
     0 226  50
     1  73 151

# Accuracy
accuracy <- sum(diag(confusion_matrix)) / sum(confusion_matrix)
cat("Accuracy of the Logistic Regression Model: ", round(accuracy, 3), "\n")

Accuracy of the Logistic Regression Model:  0.754

Data Simulation:

When I approached this analysis, I started by simulating a dataset that mirrors real-world scenarios. I used balance as a continuous predictor variable, representing credit card balances, and default as a binary outcome, indicating whether an individual defaults on their payment. This setup allowed me to control the relationship between these variables through the logistic function. I specifically chose the logistic function because it naturally captures probabilities within the range [0, 1], ensuring that my outcomes remained realistic and interpretable.

Exploratory Data Analysis:

I believe it’s always essential to understand the data before diving into modeling. To achieve this, I visualized the raw data using a jitter plot, which helped me see how balance values distribute between those who defaulted and those who didn’t. The jitter ensured that overlapping points became distinguishable, making patterns clearer. This visualization gave me a solid intuition about the potential relationship between balance and the likelihood of default.

Logistic Regression Model:

The logistic regression model I fit is the heart of this analysis. It predicts the probability of default using the logistic function:

This formula transforms the linear relationship into an S-shaped curve, ensuring that probabilities stay between 0 and 1. I specifically chose this model because linear regression would predict probabilities outside this range, which isn’t meaningful for binary outcomes. The coefficients represent the log-odds of default and the rate of change in those odds with respect to balance, respectively. I found the interpretability of these coefficients invaluable when discussing results.

Visualization:

To make my model’s predictions more tangible, I overlaid the predicted probabilities onto the jitter plot. This additional layer brought the analysis to life, as the S-shaped logistic curve became visible against the data points. Seeing how the curve smoothly transitions between probabilities close to 0 and 1 for increasing balance values affirmed the appropriateness of the logistic model for this dataset.

Model Validation:

Validation is critical to ensure that my model is both meaningful and reliable. I calculated McFadden’s pseudo as a measure of goodness-of-fit. Unlike the traditional R2R^2R2 in linear regression, McFadden’s pseudo is tailored for logistic regression and assesses how well the model explains the variation in the data. This metric, combined with other diagnostics, helped me gauge the model’s effectiveness.

Predictions:

To make practical use of the model, I generated predictions and classified individuals as either defaulters or non-defaulters using a probability threshold of 0.5. I then constructed a confusion matrix to evaluate how well the model performed in categorizing the data. This step was particularly rewarding as it quantified the model’s accuracy and highlighted areas for potential improvement.

Results

Coefficients:

The coefficient for balance indicated how the log odds of default changed with each unit increase in balance. Specifically, for every additional dollar in balance, the odds of default increased multiplicatively.

Predicted Probabilities:

For instance, when I input a balance of $2000 into the model, the predicted probability of default was approximately 0.24. This means that an individual with a $2000 balance has about a 24% chance of defaulting, based on the model.

Model Fit:

McFadden’s pseudo R2R^2R2 demonstrated that the model provided a reasonable fit for the data. This reassured me that the logistic regression was capturing meaningful patterns in the data without overfitting.

Conclusion:

This analysis confirmed the power of logistic regression for modeling binary outcomes like default. By carefully constructing and validating the model, I was able to interpret probabilities in a way that felt intuitive and actionable. Through this process, I gained deeper insights into the relationship between balance and default, illustrating the utility of statistical modeling in decision-making scenarios.