ECON 465 – Week 8 Lab: Classification for Economic Decision-Making
Author
Gül Ertan Özgüzer
Published
April 30, 2025
Lab Objectives
By the end of this lab, you will be able to:
Distinguish between classification and regression problems
Implement logistic regression for binary economic outcomes using the Default dataset
Evaluate classifiers using confusion matrices, accuracy, precision, recall, and ROC curves
Choose appropriate probability thresholds for economic decision-making
Understand the challenges of class imbalance in economic data
The Economic Question
Can we predict whether a borrower will default on their credit card payment based on their financial characteristics? How can banks use such predictions to make lending decisions that balance profit and risk?
In this lab, we use the Default dataset from the ISLR package – a real dataset on credit card default. This is a binary classification problem: the outcome is either “Yes” (default) or “No” (no default).
Dataset: Default (from ISLR)
The Default dataset contains information on 10,000 credit card customers. The variables are:
Variable
Description
default
Whether the customer defaulted (Yes/No)
student
Whether the customer is a student (Yes/No)
balance
Average credit card balance (in USD)
income
Annual income (in thousand USD)
# Load required packageslibrary(tidyverse)library(tidymodels)library(ISLR)library(ggplot2)library(yardstick)# Load the datasetdata("Default")# Look at the dataglimpse(Default)
Rows: 10,000
Columns: 4
$ default <fct> No, No, No, No, No, No, No, No, No, No, No, No, No, No, No, No…
$ student <fct> No, Yes, No, No, No, Yes, No, Yes, No, No, Yes, Yes, No, No, N…
$ balance <dbl> 729.5265, 817.1804, 1073.5492, 529.2506, 785.6559, 919.5885, 8…
$ income <dbl> 44361.625, 12106.135, 31767.139, 35704.494, 38463.496, 7491.55…
# Check the proportion of defaulttable(Default$default)
No Yes
9667 333
prop.table(table(Default$default))
No Yes
0.9667 0.0333
Note: Only about 3.3% of customers default – this is an imbalanced dataset, which we will discuss later.
Initial Observation: Defaulters have higher average balances, slightly lower incomes, and are more likely to be students.
Part 1: Binary Classification – What Is It?
1.1 Classification vs. Regression
Regression
Classification
Outcome
Continuous (e.g., price, GDP)
Categorical (e.g., default/no default)
Goal
Predict a number
Predict a class
Example
“What will the house price be?”
“Will this borrower default?”
Evaluation
RMSE, R²
Accuracy, Precision, Recall
In this lab, we focus on binary classification – only two possible outcomes: “Yes” (default) or “No” (no default).
1.2 Why Not Linear Regression?
We could try to predict default using linear regression (coding Yes=1, No=0). But:
Predicted values might fall outside [0,1] (e.g., -0.2 or 1.5), which makes no sense for a probability.
The relationship between predictors and probability is rarely linear.
Logistic regression solves this by modeling the probability of default, not the class itself.
Part 2: Exploratory Data Analysis for Classification
Before building models, let’s visualize the relationships.
# Balance distribution by default statusggplot(Default, aes(x = default, y = balance, fill = default)) +geom_boxplot() +labs(title ="Credit Card Balance by Default Status",x ="Default",y ="Balance (USD)" ) +theme_minimal() +theme(legend.position ="none")
# Income distribution by default statusggplot(Default, aes(x = default, y = income, fill = default)) +geom_boxplot() +labs(title ="Income by Default Status",x ="Default",y ="Income (USD)" ) +theme_minimal() +theme(legend.position ="none")
# Scatterplot: Balance vs. Income, colored by defaultggplot(Default, aes(x = balance, y = income, color = default)) +geom_point(alpha =0.5) +labs(title ="Default Status: Balance vs. Income",x ="Balance (USD)",y ="Income (USD)" ) +theme_minimal()
Observations: Defaulters tend to have higher balances and appear more concentrated in the low-income, high-balance region.
Part 3: Logistic Regression
3.1 How Logistic Regression Works
Instead of modeling default directly, logistic regression models the log-odds of default:
# Set seed for reproducibilityset.seed(465)# Split the data (80% training, 20% testing)default_split <-initial_split(Default, prop =0.8)default_train <-training(default_split)default_test <-testing(default_split)cat("Training set size:", nrow(default_train), "\n")
Training set size: 8000
cat("Test set size:", nrow(default_test), "\n")
Test set size: 2000
3.4 Train Logistic Regression Model
# Fit logistic regressionlogistic_model <-logistic_reg() |>set_engine("glm") |>set_mode("classification") |>fit(default ~ balance + income + student, data = default_train)# View model coefficientstidy(logistic_model)
Students have lower default probability (holding balance constant)
Wait: Earlier we saw students default more often? That’s because students have higher balances on average. After controlling for balance, being a student actually reduces default risk.
3.5 Making Predictions
# Predict on test settest_predictions <- default_test |>bind_cols(predict(logistic_model, default_test, type ="prob")) |>bind_cols(predict(logistic_model, default_test, type ="class"))# View first few predictionstest_predictions |>select(default, .pred_No, .pred_Yes, .pred_class) |>head()
default .pred_No .pred_Yes .pred_class
1 No 0.9988616 1.138365e-03 No
2 No 0.9980995 1.900505e-03 No
3 No 0.9976194 2.380629e-03 No
4 No 0.9877215 1.227853e-02 No
5 No 0.9999790 2.096526e-05 No
6 No 0.9997838 2.162200e-04 No
.pred_No: predicted probability of no default
.pred_Yes: predicted probability of default
.pred_class: predicted class using default threshold of 0.5
The bank’s problem: The model misses 23 out of 27 actual defaulters (low recall). This could be costly for the bank.
3.8 Choosing a Threshold
The default threshold is 0.5, but we can adjust it to balance precision and recall.
# Histogram of predicted probabilitiesggplot(test_predictions, aes(x = .pred_Yes)) +geom_histogram(binwidth =0.05, fill ="steelblue", color ="white") +geom_vline(xintercept =0.5, color ="red", linetype ="dashed") +labs(title ="Distribution of Predicted Default Probabilities",subtitle ="Red line: default threshold (0.5). Most probabilities are near 0.",x ="Predicted Probability of Default",y ="Number of Loans" ) +theme_minimal()
Discussion: Logistic regression often performs well on this dataset because the relationship is roughly linear. k-NN can capture non-linear patterns but requires careful tuning and feature scaling.
Part 5: ROC Curves and AUC
5.1 What Is an ROC Curve?
The ROC (Receiver Operating Characteristic) curve plots:
True Positive Rate (Recall) on the y-axis
False Positive Rate (1 - Specificity) on the x-axis
It shows how the classifier performs across all possible thresholds.
Only 3.3% of observations are defaults. A model that predicts “No” for everyone would have 96.7% accuracy but would be useless for identifying defaulters.
# What if we predicted "No" for everyone?baseline_accuracy <-nrow(filter(Default, default =="No")) /nrow(Default)cat("Baseline accuracy (predict 'No' for all):", round(baseline_accuracy, 3), "\n")
Baseline accuracy (predict 'No' for all): 0.967
Recall = 0 for the baseline model – it catches no defaulters.
6.2 Addressing Imbalance
Options for handling class imbalance:
Use appropriate metrics (precision, recall, F1, AUC) rather than accuracy
Adjust the threshold (as we did earlier)
Collect more data (if possible)
Use sampling techniques (SMOTE, upsampling, downsampling) – beyond this lab
# F1 score (harmonic mean of precision and recall)test_predictions |>f_meas(truth = default, estimate = .pred_class)