ECON 465 – Week 8 Lab: Classification for Economic Decision-Making

Author

Gül Ertan Özgüzer

Published

May 7, 2025

Lab Objectives

By the end of this lab, you will be able to:

  • Understand what classification means in economics
  • Implement logistic regression to predict loan default
  • Evaluate predictions using confusion matrices
  • Choose the right threshold for economic decisions
  • Interpret coefficient significance from model output

The Economic Question

Can we predict whether a borrower will default on their credit card payment?

Banks need to answer this question when deciding who gets a loan. In this lab, we use the Default dataset from the ISLR package. This dataset contains information on 10,000 credit card customers.


Dataset: Default

Variable Description
default Whether the customer defaulted: Yes or No
student Whether the customer is a student: Yes or No
balance Average credit card balance in USD
income Annual income in USD
# Load packages
library(tidyverse)
library(ISLR)
library(tidymodels)

# Load data
data("Default")

# Look at the data
glimpse(Default)
Rows: 10,000
Columns: 4
$ default <fct> No, No, No, No, No, No, No, No, No, No, No, No, No, No, No, No…
$ student <fct> No, Yes, No, No, No, Yes, No, Yes, No, No, Yes, Yes, No, No, N…
$ balance <dbl> 729.5265, 817.1804, 1073.5492, 529.2506, 785.6559, 919.5885, 8…
$ income  <dbl> 44361.625, 12106.135, 31767.139, 35704.494, 38463.496, 7491.55…
# How many people defaulted?
table(Default$default)

  No  Yes 
9667  333 

Only 333 out of 10,000 customers defaulted, which is about 3.3%. This is a small number. We call this an imbalanced dataset because one outcome, “No default”, is much more common than the other outcome, “Default”.

# Summary statistics by default status
Default |>
  group_by(default) |>
  summarize(
    avg_balance = mean(balance),
    avg_income = mean(income),
    prop_student = mean(student == "Yes")
  )
# A tibble: 2 × 4
  default avg_balance avg_income prop_student
  <fct>         <dbl>      <dbl>        <dbl>
1 No             804.     33566.        0.291
2 Yes           1748.     32089.        0.381

Initial observation: Defaulters have higher average balances, slightly lower incomes, and are more likely to be students.


Part 1: What Is Classification?

1.1 Classification vs. Regression

Feature Regression Classification
What we predict A number, such as price or GDP A category, such as default or not
Example question What will the house price be? Will this customer default?
Output 150,000 TL Yes or No

In regression, the outcome variable is numerical. In classification, the outcome variable is categorical.

In this lab, our outcome variable is default, which has two possible categories: Yes and No. Therefore, this is a binary classification problem.

1.2 Logistic Regression

Logistic regression is one of the most common methods for binary classification. It predicts the probability that an event will occur.

In this lab, logistic regression predicts the probability that a customer will default.

For example, the model may predict:

  • Customer A has a 2% probability of default.
  • Customer B has a 45% probability of default.
  • Customer C has a 78% probability of default.

Then we use a threshold to convert these probabilities into class predictions.


Part 2: Prepare the Data

2.1 Convert Variables to Factors

Why do we convert variables to factors?

R needs categorical variables to be encoded as factors for classification models.

# Convert Yes/No variables to factors
Default <- Default |>
  mutate(
    default = factor(default, levels = c("No", "Yes")),
    student = factor(student, levels = c("No", "Yes"))
  )

2.2 Split the Data into Training and Test Sets

We need two datasets:

Dataset Purpose
Training set Used to build, or estimate, the model
Test set Used to evaluate how well the model predicts new, unseen data

We will use 80% of the data for training and 20% for testing.

# Set seed for reproducibility
set.seed(465)

# Split the data: 80% training, 20% testing
default_split <- initial_split(Default, prop = 0.8)

default_train <- training(default_split)
default_test <- testing(default_split)

cat("Training set size:", nrow(default_train), "\n")
Training set size: 8000 
cat("Test set size:", nrow(default_test), "\n")
Test set size: 2000 

Part 3: Logistic Regression

3.1 How Logistic Regression Works

Logistic regression predicts the probability of default.

The output is always between 0 and 1, so it can be interpreted as a valid probability.

For example:

Predicted Probability Interpretation
0.02 Very low probability of default
0.35 Moderate probability of default
0.80 High probability of default

By default, we usually use a threshold of 0.5:

  • If predicted probability is greater than 0.5, predict Yes.
  • If predicted probability is less than or equal to 0.5, predict No.

3.2 Why Not Use Ordinary Linear Regression?

We should not use ordinary linear regression for this problem because linear regression may produce predicted values below 0 or above 1.

For example, it could predict a default probability of -0.20 or 1.30. These values do not make sense as probabilities.

Logistic regression solves this problem by using a special function that maps predictions into the interval between 0 and 1.

3.3 Build the Logistic Regression Model

Now we estimate the logistic regression model using the glm() function with family = binomial.

What does glm stand for?

glm means Generalized Linear Model. It is a flexible family of regression models. When we use family = binomial, we are telling R to estimate a logistic regression model for a binary outcome.

# Logistic regression using glm()
logistic_model <- glm(
  default ~ balance + income + student,
  data = default_train,
  family = binomial
)

# View coefficients and significance
summary(logistic_model)

Call:
glm(formula = default ~ balance + income + student, family = binomial, 
    data = default_train)

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept) -1.078e+01  5.516e-01 -19.551  < 2e-16 ***
balance      5.743e-03  2.617e-04  21.944  < 2e-16 ***
income       2.327e-07  9.146e-06   0.025  0.97970    
studentYes  -6.882e-01  2.628e-01  -2.619  0.00881 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 2286.5  on 7999  degrees of freedom
Residual deviance: 1229.5  on 7996  degrees of freedom
AIC: 1237.5

Number of Fisher Scoring iterations: 8

3.4 Interpreting Coefficients and Statistical Significance

The estimated coefficients tell us how each variable is associated with the probability of default.

Variable Expected Sign Interpretation
balance Positive Higher balance is associated with higher default probability
income Small positive Income has a weak effect
studentYes Negative Students have lower default probability after controlling for balance

However, we should also check whether the coefficients are statistically significant.

In the model output, look at the column called Pr(>|z|). This column gives the p-value.

Significance Code p-value Range Meaning
*** p < 0.001 Very strong evidence
** p < 0.01 Strong evidence
* p < 0.05 Moderate evidence
. p < 0.10 Weak evidence
blank p > 0.10 Not statistically significant

A corrected interpretation is usually as follows:

Variable Sign Typical p-value Significant? Interpretation
balance Positive < 0.001 Yes Higher balance is associated with higher default risk
income Positive > 0.05 No No strong evidence that income affects default after controlling for balance
studentYes Negative < 0.05 Yes Students have lower default risk after accounting for balance

Why is income not significant?

In this dataset, once we know a customer’s credit card balance, income provides almost no additional information for predicting default.

3.5 Make Predictions on the Test Data

Now that we have estimated the model using the training data, we use it to make predictions on the test data.

We need two types of predictions:

Prediction Type Description
Predicted probabilities Continuous values between 0 and 1
Predicted classes Yes or No predictions based on a threshold

What does type = "response" mean?

It tells R to return predicted probabilities between 0 and 1 instead of raw log-odds.

# Predict probabilities on test data
logistic_probs <- predict(logistic_model, default_test, type = "response")

# Convert probabilities to Yes/No using threshold 0.5
logistic_pred <- ifelse(logistic_probs > 0.5, "Yes", "No")

# Convert predictions to factor
logistic_pred <- factor(logistic_pred, levels = c("No", "Yes"))

# View first few predictions
head(data.frame(
  Actual = default_test$default,
  Probability = round(logistic_probs, 3),
  Predicted = logistic_pred
))
  Actual Probability Predicted
1     No       0.001        No
2     No       0.002        No
3     No       0.002        No
4     No       0.012        No
5     No       0.000        No
6     No       0.000        No

Part 4: How Good Are Our Predictions?

4.1 Confusion Matrix

A confusion matrix compares the model’s predictions to the actual outcomes.

It shows four numbers:

Actual: No Actual: Yes
Predicted: No True Negatives False Negatives
Predicted: Yes False Positives True Positives

In words:

Component Meaning
True Negative The model correctly predicts no default
False Positive The model predicts default, but the customer does not default
False Negative The model predicts no default, but the customer actually defaults
True Positive The model correctly predicts default
# Create confusion matrix
confusion <- table(
  Predicted = logistic_pred,
  Actual = default_test$default
)

confusion
         Actual
Predicted   No  Yes
      No  1917   47
      Yes    9   27

4.2 Calculate Performance Metrics

Now we calculate three important metrics from the confusion matrix.

# Extract the four numbers from the confusion matrix
TN <- confusion["No", "No"]     # True Negatives
FP <- confusion["Yes", "No"]    # False Positives
FN <- confusion["No", "Yes"]    # False Negatives
TP <- confusion["Yes", "Yes"]   # True Positives

# Accuracy = correct predictions / total predictions
accuracy <- (TP + TN) / (TP + TN + FP + FN)

# Precision = TP / (TP + FP)
# When we predict "Default", how often are we right?
precision <- ifelse(TP + FP > 0, TP / (TP + FP), 0)

# Recall = TP / (TP + FN)
# How many actual defaulters did we catch?
recall <- TP / (TP + FN)

cat("Accuracy:", round(accuracy, 3), "\n")
Accuracy: 0.972 
cat("Precision:", round(precision, 3), "\n")
Precision: 0.75 
cat("Recall:", round(recall, 3), "\n")
Recall: 0.365 

4.3 Understanding the Metrics for a Bank

Metric Question What it means for the bank
Accuracy What percentage of predictions were correct? Overall performance
Precision When we predict default, are we right? Helps avoid rejecting good customers
Recall What percentage of actual defaulters did we catch? Helps avoid giving loans to bad customers

Accuracy may look very high in this dataset because most customers do not default.

However, a bank is especially interested in identifying customers who are likely to default. Therefore, recall and precision are very important.


Part 5: Choosing a Threshold

5.1 What Is a Threshold?

We used 0.5 as the threshold.

This means:

  • If predicted probability is greater than 0.5, predict default.
  • If predicted probability is less than or equal to 0.5, predict no default.

However, we can choose a different threshold.

5.2 The Threshold Trade-off

Threshold Choice Effect
Lower threshold, such as 0.2 Predict default more often
Higher threshold, such as 0.7 Predict default less often

A lower threshold usually increases recall because the model catches more actual defaulters. However, it also creates more false alarms, meaning that some good customers may be rejected.

A higher threshold usually increases precision because the model predicts default only when it is more certain. However, it may miss more actual defaulters.

5.3 Try a Lower Threshold

# Use threshold 0.2 instead of 0.5
logistic_pred_low <- ifelse(logistic_probs > 0.2, "Yes", "No")

logistic_pred_low <- factor(logistic_pred_low, levels = c("No", "Yes"))

# New confusion matrix with lower threshold
confusion_low <- table(
  Predicted = logistic_pred_low,
  Actual = default_test$default
)

confusion_low
         Actual
Predicted   No  Yes
      No  1866   29
      Yes   60   45
# Calculate recall with the lower threshold
TP_low <- confusion_low["Yes", "Yes"]
FN_low <- confusion_low["No", "Yes"]

recall_low <- TP_low / (TP_low + FN_low)

cat("Recall with threshold 0.2:", round(recall_low, 3), "\n")
Recall with threshold 0.2: 0.608 
cat("Recall with threshold 0.5:", round(recall, 3), "\n")
Recall with threshold 0.5: 0.365 

5.4 Which Threshold Should a Bank Choose?

Bank’s Priority Better Threshold Choice Why
Avoid giving loans to defaulters Lower threshold, such as 0.2 Catches more defaulters, even if some good customers are rejected
Avoid rejecting good customers Higher threshold, such as 0.5 or 0.7 Predicts default only when more confident, but misses more defaulters

Real banks use complex profit models to find the threshold that maximizes expected profit. They balance the cost of false positives, which means rejecting a good customer, against the cost of false negatives, which means giving a loan to a customer who defaults.


Part 6: Your Turn – Simple Practice

Task 1: Compare Model Complexity

Build a logistic regression model using only balance as a predictor.

Compare its accuracy with the full model, which uses balance, income, and student.

# Logistic regression with only balance
logistic_simple <- glm(
  default ~ balance,
  data = default_train,
  family = binomial
)

# Predict probabilities on test data
simple_probs <- predict(logistic_simple, default_test, type = "response")

# Convert probabilities to Yes/No predictions using threshold 0.5
simple_pred <- ifelse(simple_probs > 0.5, "Yes", "No")

simple_pred <- factor(simple_pred, levels = c("No", "Yes"))

# Calculate accuracy
simple_accuracy <- mean(simple_pred == default_test$default)

cat("Full model accuracy:", round(accuracy, 3), "\n")
Full model accuracy: 0.972 
cat("Balance-only accuracy:", round(simple_accuracy, 3), "\n")
Balance-only accuracy: 0.97 

Question: Does adding income and student improve accuracy? Given that income was not statistically significant, would you expect it to help?

Write your answer here:

Task 2: Interpret Your Results

Look at the coefficients from your simple model using only balance.

summary(logistic_simple)

Call:
glm(formula = default ~ balance, family = binomial, data = default_train)

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept) -1.070e+01  4.095e-01  -26.14   <2e-16 ***
balance      5.524e-03  2.498e-04   22.11   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 2286.5  on 7999  degrees of freedom
Residual deviance: 1247.8  on 7998  degrees of freedom
AIC: 1251.8

Number of Fisher Scoring iterations: 8

Answer the following questions:

  1. Is the balance coefficient positive?
  2. Is the balance coefficient statistically significant?
  3. What does this mean in economic terms?

Write your observations here:


Summary: What We Learned Today

Logistic Regression Summary

Concept Key Idea
GLM Generalized Linear Model. glm() with family = binomial gives logistic regression
Classification Predicting categories, such as Yes or No, rather than numbers
Training vs. test set Train on 80%, evaluate on 20% to assess how well the model generalizes
type = "response" Tells R to return probabilities from 0 to 1 instead of log-odds
Confusion matrix Compares predictions to actual outcomes
Accuracy Overall correct predictions divided by total predictions
Precision When we say “Default”, how often are we right?
Recall How many actual defaulters did we catch?
Threshold Cutoff for predicting default
Statistical significance p-values and stars in the model output tell us whether a relationship is likely real or due to chance

Key Takeaways

Balance is the most important predictor of credit default. It is usually highly statistically significant.

Income is not statistically significant in the full model. Once we know balance, income adds little additional information.

Students have lower default risk after accounting for their higher balances.

Accuracy alone can be misleading, especially for imbalanced data.

For imbalanced data, recall and precision are often more informative than accuracy.

Threshold choice depends on bank priorities. There is no single correct threshold.

Take-Home Message

Banks must balance catching defaulters, which is measured by recall, against rejecting good customers, which is related to precision.

There is no perfect model or threshold. The right choice depends on the economic costs of errors.

Always check statistical significance before interpreting coefficients.


Glossary of Functions

Function What it does
initial_split() Splits data into training and test sets
training() Extracts the training data
testing() Extracts the test data
glm(..., family = binomial) Builds a logistic regression model
predict(type = "response") Gets predicted probabilities from a logistic regression model
ifelse() Converts probabilities into Yes/No predictions based on a threshold
table() Creates a confusion matrix
summary(model) Shows coefficients, p-values, and significance stars
round() Rounds numerical values
cat() Prints text and results clearly