ECON 465 – Week 8 Lab: Classification for Economic Decision-Making
Author
Gül Ertan Özgüzer
Published
May 7, 2025
Lab Objectives
By the end of this lab, you will be able to:
Understand what classification means in economics
Implement logistic regression to predict loan default
Evaluate predictions using confusion matrices
Choose the right threshold for economic decisions
Interpret coefficient significance from model output
The Economic Question
Can we predict whether a borrower will default on their credit card payment?
Banks need to answer this question when deciding who gets a loan. In this lab, we use the Default dataset from the ISLR package. This dataset contains information on 10,000 credit card customers.
Dataset: Default
Variable
Description
default
Whether the customer defaulted: Yes or No
student
Whether the customer is a student: Yes or No
balance
Average credit card balance in USD
income
Annual income in USD
# Load packageslibrary(tidyverse)library(ISLR)library(tidymodels)# Load datadata("Default")# Look at the dataglimpse(Default)
Rows: 10,000
Columns: 4
$ default <fct> No, No, No, No, No, No, No, No, No, No, No, No, No, No, No, No…
$ student <fct> No, Yes, No, No, No, Yes, No, Yes, No, No, Yes, Yes, No, No, N…
$ balance <dbl> 729.5265, 817.1804, 1073.5492, 529.2506, 785.6559, 919.5885, 8…
$ income <dbl> 44361.625, 12106.135, 31767.139, 35704.494, 38463.496, 7491.55…
# How many people defaulted?table(Default$default)
No Yes
9667 333
Only 333 out of 10,000 customers defaulted, which is about 3.3%. This is a small number. We call this an imbalanced dataset because one outcome, “No default”, is much more common than the other outcome, “Default”.
Used to evaluate how well the model predicts new, unseen data
We will use 80% of the data for training and 20% for testing.
# Set seed for reproducibilityset.seed(465)# Split the data: 80% training, 20% testingdefault_split <-initial_split(Default, prop =0.8)default_train <-training(default_split)default_test <-testing(default_split)cat("Training set size:", nrow(default_train), "\n")
Training set size: 8000
cat("Test set size:", nrow(default_test), "\n")
Test set size: 2000
Part 3: Logistic Regression
3.1 How Logistic Regression Works
Logistic regression predicts the probability of default.
The output is always between 0 and 1, so it can be interpreted as a valid probability.
For example:
Predicted Probability
Interpretation
0.02
Very low probability of default
0.35
Moderate probability of default
0.80
High probability of default
By default, we usually use a threshold of 0.5:
If predicted probability is greater than 0.5, predict Yes.
If predicted probability is less than or equal to 0.5, predict No.
3.2 Why Not Use Ordinary Linear Regression?
We should not use ordinary linear regression for this problem because linear regression may produce predicted values below 0 or above 1.
For example, it could predict a default probability of -0.20 or 1.30. These values do not make sense as probabilities.
Logistic regression solves this problem by using a special function that maps predictions into the interval between 0 and 1.
3.3 Build the Logistic Regression Model
Now we estimate the logistic regression model using the glm() function with family = binomial.
What does glm stand for?
glm means Generalized Linear Model. It is a flexible family of regression models. When we use family = binomial, we are telling R to estimate a logistic regression model for a binary outcome.
# Logistic regression using glm()logistic_model <-glm( default ~ balance + income + student,data = default_train,family = binomial)# View coefficients and significancesummary(logistic_model)
Call:
glm(formula = default ~ balance + income + student, family = binomial,
data = default_train)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.078e+01 5.516e-01 -19.551 < 2e-16 ***
balance 5.743e-03 2.617e-04 21.944 < 2e-16 ***
income 2.327e-07 9.146e-06 0.025 0.97970
studentYes -6.882e-01 2.628e-01 -2.619 0.00881 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 2286.5 on 7999 degrees of freedom
Residual deviance: 1229.5 on 7996 degrees of freedom
AIC: 1237.5
Number of Fisher Scoring iterations: 8
3.4 Interpreting Coefficients and Statistical Significance
The estimated coefficients tell us how each variable is associated with the probability of default.
Variable
Expected Sign
Interpretation
balance
Positive
Higher balance is associated with higher default probability
income
Small positive
Income has a weak effect
studentYes
Negative
Students have lower default probability after controlling for balance
However, we should also check whether the coefficients are statistically significant.
In the model output, look at the column called Pr(>|z|). This column gives the p-value.
Significance Code
p-value Range
Meaning
***
p < 0.001
Very strong evidence
**
p < 0.01
Strong evidence
*
p < 0.05
Moderate evidence
.
p < 0.10
Weak evidence
blank
p > 0.10
Not statistically significant
A corrected interpretation is usually as follows:
Variable
Sign
Typical p-value
Significant?
Interpretation
balance
Positive
< 0.001
Yes
Higher balance is associated with higher default risk
income
Positive
> 0.05
No
No strong evidence that income affects default after controlling for balance
studentYes
Negative
< 0.05
Yes
Students have lower default risk after accounting for balance
Why is income not significant?
In this dataset, once we know a customer’s credit card balance, income provides almost no additional information for predicting default.
3.5 Make Predictions on the Test Data
Now that we have estimated the model using the training data, we use it to make predictions on the test data.
We need two types of predictions:
Prediction Type
Description
Predicted probabilities
Continuous values between 0 and 1
Predicted classes
Yes or No predictions based on a threshold
What does type = "response" mean?
It tells R to return predicted probabilities between 0 and 1 instead of raw log-odds.
# Predict probabilities on test datalogistic_probs <-predict(logistic_model, default_test, type ="response")# Convert probabilities to Yes/No using threshold 0.5logistic_pred <-ifelse(logistic_probs >0.5, "Yes", "No")# Convert predictions to factorlogistic_pred <-factor(logistic_pred, levels =c("No", "Yes"))# View first few predictionshead(data.frame(Actual = default_test$default,Probability =round(logistic_probs, 3),Predicted = logistic_pred))
Actual Probability Predicted
1 No 0.001 No
2 No 0.002 No
3 No 0.002 No
4 No 0.012 No
5 No 0.000 No
6 No 0.000 No
Part 4: How Good Are Our Predictions?
4.1 Confusion Matrix
A confusion matrix compares the model’s predictions to the actual outcomes.
It shows four numbers:
Actual: No
Actual: Yes
Predicted: No
True Negatives
False Negatives
Predicted: Yes
False Positives
True Positives
In words:
Component
Meaning
True Negative
The model correctly predicts no default
False Positive
The model predicts default, but the customer does not default
False Negative
The model predicts no default, but the customer actually defaults
Now we calculate three important metrics from the confusion matrix.
# Extract the four numbers from the confusion matrixTN <- confusion["No", "No"] # True NegativesFP <- confusion["Yes", "No"] # False PositivesFN <- confusion["No", "Yes"] # False NegativesTP <- confusion["Yes", "Yes"] # True Positives# Accuracy = correct predictions / total predictionsaccuracy <- (TP + TN) / (TP + TN + FP + FN)# Precision = TP / (TP + FP)# When we predict "Default", how often are we right?precision <-ifelse(TP + FP >0, TP / (TP + FP), 0)# Recall = TP / (TP + FN)# How many actual defaulters did we catch?recall <- TP / (TP + FN)cat("Accuracy:", round(accuracy, 3), "\n")
Accuracy: 0.972
cat("Precision:", round(precision, 3), "\n")
Precision: 0.75
cat("Recall:", round(recall, 3), "\n")
Recall: 0.365
4.3 Understanding the Metrics for a Bank
Metric
Question
What it means for the bank
Accuracy
What percentage of predictions were correct?
Overall performance
Precision
When we predict default, are we right?
Helps avoid rejecting good customers
Recall
What percentage of actual defaulters did we catch?
Helps avoid giving loans to bad customers
Accuracy may look very high in this dataset because most customers do not default.
However, a bank is especially interested in identifying customers who are likely to default. Therefore, recall and precision are very important.
Part 5: Choosing a Threshold
5.1 What Is a Threshold?
We used 0.5 as the threshold.
This means:
If predicted probability is greater than 0.5, predict default.
If predicted probability is less than or equal to 0.5, predict no default.
However, we can choose a different threshold.
5.2 The Threshold Trade-off
Threshold Choice
Effect
Lower threshold, such as 0.2
Predict default more often
Higher threshold, such as 0.7
Predict default less often
A lower threshold usually increases recall because the model catches more actual defaulters. However, it also creates more false alarms, meaning that some good customers may be rejected.
A higher threshold usually increases precision because the model predicts default only when it is more certain. However, it may miss more actual defaulters.
5.3 Try a Lower Threshold
# Use threshold 0.2 instead of 0.5logistic_pred_low <-ifelse(logistic_probs >0.2, "Yes", "No")logistic_pred_low <-factor(logistic_pred_low, levels =c("No", "Yes"))# New confusion matrix with lower thresholdconfusion_low <-table(Predicted = logistic_pred_low,Actual = default_test$default)confusion_low
Actual
Predicted No Yes
No 1866 29
Yes 60 45
# Calculate recall with the lower thresholdTP_low <- confusion_low["Yes", "Yes"]FN_low <- confusion_low["No", "Yes"]recall_low <- TP_low / (TP_low + FN_low)cat("Recall with threshold 0.2:", round(recall_low, 3), "\n")
Recall with threshold 0.2: 0.608
cat("Recall with threshold 0.5:", round(recall, 3), "\n")
Recall with threshold 0.5: 0.365
5.4 Which Threshold Should a Bank Choose?
Bank’s Priority
Better Threshold Choice
Why
Avoid giving loans to defaulters
Lower threshold, such as 0.2
Catches more defaulters, even if some good customers are rejected
Avoid rejecting good customers
Higher threshold, such as 0.5 or 0.7
Predicts default only when more confident, but misses more defaulters
Real banks use complex profit models to find the threshold that maximizes expected profit. They balance the cost of false positives, which means rejecting a good customer, against the cost of false negatives, which means giving a loan to a customer who defaults.
Part 6: Your Turn – Simple Practice
Task 1: Compare Model Complexity
Build a logistic regression model using only balance as a predictor.
Compare its accuracy with the full model, which uses balance, income, and student.
# Logistic regression with only balancelogistic_simple <-glm( default ~ balance,data = default_train,family = binomial)# Predict probabilities on test datasimple_probs <-predict(logistic_simple, default_test, type ="response")# Convert probabilities to Yes/No predictions using threshold 0.5simple_pred <-ifelse(simple_probs >0.5, "Yes", "No")simple_pred <-factor(simple_pred, levels =c("No", "Yes"))# Calculate accuracysimple_accuracy <-mean(simple_pred == default_test$default)cat("Full model accuracy:", round(accuracy, 3), "\n")