ECON 465 – Week 8 Lab: Classification for Economic Decision-Making

Author

Gül Ertan Özgüzer

Published

May 7, 2025

Lab Objectives

By the end of this lab, you will be able to:

Understand what classification means in economics
Implement logistic regression to predict loan default
Evaluate predictions using confusion matrices
Choose the right threshold for economic decisions
Interpret coefficient significance from model output

The Economic Question

Can we predict whether a borrower will default on their credit card payment?

Banks need to answer this question when deciding who gets a loan. In this lab, we use the Default dataset from the ISLR package. This dataset contains information on 10,000 credit card customers.

Dataset: Default

Variable	Description
`default`	Whether the customer defaulted: Yes or No
`student`	Whether the customer is a student: Yes or No
`balance`	Average credit card balance in USD
`income`	Annual income in USD

# Load packages
library(tidyverse)
library(ISLR)
library(tidymodels)

# Load data
data("Default")

# Look at the data
glimpse(Default)

Rows: 10,000
Columns: 4
$ default <fct> No, No, No, No, No, No, No, No, No, No, No, No, No, No, No, No…
$ student <fct> No, Yes, No, No, No, Yes, No, Yes, No, No, Yes, Yes, No, No, N…
$ balance <dbl> 729.5265, 817.1804, 1073.5492, 529.2506, 785.6559, 919.5885, 8…
$ income  <dbl> 44361.625, 12106.135, 31767.139, 35704.494, 38463.496, 7491.55…

# How many people defaulted?
table(Default$default)


  No  Yes 
9667  333

Only 333 out of 10,000 customers defaulted, which is about 3.3%. This is a small number. We call this an imbalanced dataset because one outcome, “No default”, is much more common than the other outcome, “Default”.

# Summary statistics by default status
Default |>
  group_by(default) |>
  summarize(
    avg_balance = mean(balance),
    avg_income = mean(income),
    prop_student = mean(student == "Yes")
  )

# A tibble: 2 × 4
  default avg_balance avg_income prop_student
  <fct>         <dbl>      <dbl>        <dbl>
1 No             804.     33566.        0.291
2 Yes           1748.     32089.        0.381

Initial observation: Defaulters have higher average balances, slightly lower incomes, and are more likely to be students.

Part 1: What Is Classification?

1.1 Classification vs. Regression

Feature	Regression	Classification
What we predict	A number, such as price or GDP	A category, such as default or not
Example question	What will the house price be?	Will this customer default?
Output	150,000 TL	Yes or No

In regression, the outcome variable is numerical. In classification, the outcome variable is categorical.

In this lab, our outcome variable is default, which has two possible categories: Yes and No. Therefore, this is a binary classification problem.

1.2 Logistic Regression

Logistic regression is one of the most common methods for binary classification. It predicts the probability that an event will occur.

In this lab, logistic regression predicts the probability that a customer will default.

For example, the model may predict:

Customer A has a 2% probability of default.
Customer B has a 45% probability of default.
Customer C has a 78% probability of default.

Then we use a threshold to convert these probabilities into class predictions.

Part 2: Prepare the Data

2.1 Convert Variables to Factors

Why do we convert variables to factors?

R needs categorical variables to be encoded as factors for classification models.

# Convert Yes/No variables to factors
Default <- Default |>
  mutate(
    default = factor(default, levels = c("No", "Yes")),
    student = factor(student, levels = c("No", "Yes"))
  )

2.2 Split the Data into Training and Test Sets

We need two datasets:

Dataset	Purpose
Training set	Used to build, or estimate, the model
Test set	Used to evaluate how well the model predicts new, unseen data

We will use 80% of the data for training and 20% for testing.

# Set seed for reproducibility
set.seed(465)

# Split the data: 80% training, 20% testing
default_split <- initial_split(Default, prop = 0.8)

default_train <- training(default_split)
default_test <- testing(default_split)

cat("Training set size:", nrow(default_train), "\n")

Training set size: 8000

cat("Test set size:", nrow(default_test), "\n")

Test set size: 2000

Part 3: Logistic Regression

3.1 How Logistic Regression Works

Logistic regression predicts the probability of default.

The output is always between 0 and 1, so it can be interpreted as a valid probability.

For example:

Predicted Probability	Interpretation
0.02	Very low probability of default
0.35	Moderate probability of default
0.80	High probability of default

By default, we usually use a threshold of 0.5:

If predicted probability is greater than 0.5, predict Yes.
If predicted probability is less than or equal to 0.5, predict No.

3.2 Why Not Use Ordinary Linear Regression?

We should not use ordinary linear regression for this problem because linear regression may produce predicted values below 0 or above 1.

For example, it could predict a default probability of -0.20 or 1.30. These values do not make sense as probabilities.

Logistic regression solves this problem by using a special function that maps predictions into the interval between 0 and 1.

3.3 Build the Logistic Regression Model

Now we estimate the logistic regression model using the glm() function with family = binomial.

What does glm stand for?

glm means Generalized Linear Model. It is a flexible family of regression models. When we use family = binomial, we are telling R to estimate a logistic regression model for a binary outcome.

# Logistic regression using glm()
logistic_model <- glm(
  default ~ balance + income + student,
  data = default_train,
  family = binomial
)

# View coefficients and significance
summary(logistic_model)


Call:
glm(formula = default ~ balance + income + student, family = binomial, 
    data = default_train)

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept) -1.078e+01  5.516e-01 -19.551  < 2e-16 ***
balance      5.743e-03  2.617e-04  21.944  < 2e-16 ***
income       2.327e-07  9.146e-06   0.025  0.97970    
studentYes  -6.882e-01  2.628e-01  -2.619  0.00881 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 2286.5  on 7999  degrees of freedom
Residual deviance: 1229.5  on 7996  degrees of freedom
AIC: 1237.5

Number of Fisher Scoring iterations: 8

3.4 Interpreting Coefficients and Statistical Significance

The estimated coefficients tell us how each variable is associated with the probability of default.

Variable	Expected Sign	Interpretation
`balance`	Positive	Higher balance is associated with higher default probability
`income`	Small positive	Income has a weak effect
`studentYes`	Negative	Students have lower default probability after controlling for balance

However, we should also check whether the coefficients are statistically significant.

In the model output, look at the column called Pr(>|z|). This column gives the p-value.

Significance Code	p-value Range	Meaning
`***`	p < 0.001	Very strong evidence
`**`	p < 0.01	Strong evidence
`*`	p < 0.05	Moderate evidence
`.`	p < 0.10	Weak evidence
blank	p > 0.10	Not statistically significant

A corrected interpretation is usually as follows:

Variable	Sign	Typical p-value	Significant?	Interpretation
`balance`	Positive	< 0.001	Yes	Higher balance is associated with higher default risk
`income`	Positive	> 0.05	No	No strong evidence that income affects default after controlling for balance
`studentYes`	Negative	< 0.05	Yes	Students have lower default risk after accounting for balance

Why is income not significant?

In this dataset, once we know a customer’s credit card balance, income provides almost no additional information for predicting default.

3.5 Make Predictions on the Test Data

Now that we have estimated the model using the training data, we use it to make predictions on the test data.

We need two types of predictions:

Prediction Type	Description
Predicted probabilities	Continuous values between 0 and 1
Predicted classes	Yes or No predictions based on a threshold

What does type = "response" mean?

It tells R to return predicted probabilities between 0 and 1 instead of raw log-odds.

# Predict probabilities on test data
logistic_probs <- predict(logistic_model, default_test, type = "response")

# Convert probabilities to Yes/No using threshold 0.5
logistic_pred <- ifelse(logistic_probs > 0.5, "Yes", "No")

# Convert predictions to factor
logistic_pred <- factor(logistic_pred, levels = c("No", "Yes"))

# View first few predictions
head(data.frame(
  Actual = default_test$default,
  Probability = round(logistic_probs, 3),
  Predicted = logistic_pred
))

  Actual Probability Predicted
1     No       0.001        No
2     No       0.002        No
3     No       0.002        No
4     No       0.012        No
5     No       0.000        No
6     No       0.000        No

Part 4: How Good Are Our Predictions?

4.1 Confusion Matrix

A confusion matrix compares the model’s predictions to the actual outcomes.

It shows four numbers:

	Actual: No	Actual: Yes
Predicted: No	True Negatives	False Negatives
Predicted: Yes	False Positives	True Positives

In words:

Component	Meaning
True Negative	The model correctly predicts no default
False Positive	The model predicts default, but the customer does not default
False Negative	The model predicts no default, but the customer actually defaults
True Positive	The model correctly predicts default

# Create confusion matrix
confusion <- table(
  Predicted = logistic_pred,
  Actual = default_test$default
)

confusion

         Actual
Predicted   No  Yes
      No  1917   47
      Yes    9   27

4.2 Calculate Performance Metrics

Now we calculate three important metrics from the confusion matrix.

# Extract the four numbers from the confusion matrix
TN <- confusion["No", "No"]     # True Negatives
FP <- confusion["Yes", "No"]    # False Positives
FN <- confusion["No", "Yes"]    # False Negatives
TP <- confusion["Yes", "Yes"]   # True Positives

# Accuracy = correct predictions / total predictions
accuracy <- (TP + TN) / (TP + TN + FP + FN)

# Precision = TP / (TP + FP)
# When we predict "Default", how often are we right?
precision <- ifelse(TP + FP > 0, TP / (TP + FP), 0)

# Recall = TP / (TP + FN)
# How many actual defaulters did we catch?
recall <- TP / (TP + FN)

cat("Accuracy:", round(accuracy, 3), "\n")

Accuracy: 0.972

cat("Precision:", round(precision, 3), "\n")

Precision: 0.75

cat("Recall:", round(recall, 3), "\n")

Recall: 0.365

4.3 Understanding the Metrics for a Bank

Metric	Question	What it means for the bank
Accuracy	What percentage of predictions were correct?	Overall performance
Precision	When we predict default, are we right?	Helps avoid rejecting good customers
Recall	What percentage of actual defaulters did we catch?	Helps avoid giving loans to bad customers

Accuracy may look very high in this dataset because most customers do not default.

However, a bank is especially interested in identifying customers who are likely to default. Therefore, recall and precision are very important.

Part 5: Choosing a Threshold

5.1 What Is a Threshold?

We used 0.5 as the threshold.

This means:

If predicted probability is greater than 0.5, predict default.
If predicted probability is less than or equal to 0.5, predict no default.

However, we can choose a different threshold.

5.2 The Threshold Trade-off

Threshold Choice	Effect
Lower threshold, such as 0.2	Predict default more often
Higher threshold, such as 0.7	Predict default less often

A lower threshold usually increases recall because the model catches more actual defaulters. However, it also creates more false alarms, meaning that some good customers may be rejected.

A higher threshold usually increases precision because the model predicts default only when it is more certain. However, it may miss more actual defaulters.

5.3 Try a Lower Threshold

# Use threshold 0.2 instead of 0.5
logistic_pred_low <- ifelse(logistic_probs > 0.2, "Yes", "No")

logistic_pred_low <- factor(logistic_pred_low, levels = c("No", "Yes"))

# New confusion matrix with lower threshold
confusion_low <- table(
  Predicted = logistic_pred_low,
  Actual = default_test$default
)

confusion_low

         Actual
Predicted   No  Yes
      No  1866   29
      Yes   60   45

# Calculate recall with the lower threshold
TP_low <- confusion_low["Yes", "Yes"]
FN_low <- confusion_low["No", "Yes"]

recall_low <- TP_low / (TP_low + FN_low)

cat("Recall with threshold 0.2:", round(recall_low, 3), "\n")

Recall with threshold 0.2: 0.608

cat("Recall with threshold 0.5:", round(recall, 3), "\n")

Recall with threshold 0.5: 0.365

5.4 Which Threshold Should a Bank Choose?

Bank’s Priority	Better Threshold Choice	Why
Avoid giving loans to defaulters	Lower threshold, such as 0.2	Catches more defaulters, even if some good customers are rejected
Avoid rejecting good customers	Higher threshold, such as 0.5 or 0.7	Predicts default only when more confident, but misses more defaulters

Real banks use complex profit models to find the threshold that maximizes expected profit. They balance the cost of false positives, which means rejecting a good customer, against the cost of false negatives, which means giving a loan to a customer who defaults.

Part 6: Your Turn – Simple Practice

Task 1: Compare Model Complexity

Build a logistic regression model using only balance as a predictor.

Compare its accuracy with the full model, which uses balance, income, and student.

# Logistic regression with only balance
logistic_simple <- glm(
  default ~ balance,
  data = default_train,
  family = binomial
)

# Predict probabilities on test data
simple_probs <- predict(logistic_simple, default_test, type = "response")

# Convert probabilities to Yes/No predictions using threshold 0.5
simple_pred <- ifelse(simple_probs > 0.5, "Yes", "No")

simple_pred <- factor(simple_pred, levels = c("No", "Yes"))

# Calculate accuracy
simple_accuracy <- mean(simple_pred == default_test$default)

cat("Full model accuracy:", round(accuracy, 3), "\n")

Full model accuracy: 0.972

cat("Balance-only accuracy:", round(simple_accuracy, 3), "\n")

Balance-only accuracy: 0.97

Question: Does adding income and student improve accuracy? Given that income was not statistically significant, would you expect it to help?

Write your answer here:

Task 2: Interpret Your Results

Look at the coefficients from your simple model using only balance.

summary(logistic_simple)


Call:
glm(formula = default ~ balance, family = binomial, data = default_train)

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept) -1.070e+01  4.095e-01  -26.14   <2e-16 ***
balance      5.524e-03  2.498e-04   22.11   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 2286.5  on 7999  degrees of freedom
Residual deviance: 1247.8  on 7998  degrees of freedom
AIC: 1251.8

Number of Fisher Scoring iterations: 8

Answer the following questions:

Is the balance coefficient positive?
Is the balance coefficient statistically significant?
What does this mean in economic terms?

Write your observations here:

Summary: What We Learned Today

Logistic Regression Summary

Concept	Key Idea
GLM	Generalized Linear Model. `glm()` with `family = binomial` gives logistic regression
Classification	Predicting categories, such as Yes or No, rather than numbers
Training vs. test set	Train on 80%, evaluate on 20% to assess how well the model generalizes
`type = "response"`	Tells R to return probabilities from 0 to 1 instead of log-odds
Confusion matrix	Compares predictions to actual outcomes
Accuracy	Overall correct predictions divided by total predictions
Precision	When we say “Default”, how often are we right?
Recall	How many actual defaulters did we catch?
Threshold	Cutoff for predicting default
Statistical significance	p-values and stars in the model output tell us whether a relationship is likely real or due to chance

Key Takeaways

Balance is the most important predictor of credit default. It is usually highly statistically significant.

Income is not statistically significant in the full model. Once we know balance, income adds little additional information.

Students have lower default risk after accounting for their higher balances.

Accuracy alone can be misleading, especially for imbalanced data.

For imbalanced data, recall and precision are often more informative than accuracy.

Threshold choice depends on bank priorities. There is no single correct threshold.

Take-Home Message

Banks must balance catching defaulters, which is measured by recall, against rejecting good customers, which is related to precision.

There is no perfect model or threshold. The right choice depends on the economic costs of errors.

Always check statistical significance before interpreting coefficients.

Glossary of Functions

Function	What it does
`initial_split()`	Splits data into training and test sets
`training()`	Extracts the training data
`testing()`	Extracts the test data
`glm(..., family = binomial)`	Builds a logistic regression model
`predict(type = "response")`	Gets predicted probabilities from a logistic regression model
`ifelse()`	Converts probabilities into Yes/No predictions based on a threshold
`table()`	Creates a confusion matrix
`summary(model)`	Shows coefficients, p-values, and significance stars
`round()`	Rounds numerical values
`cat()`	Prints text and results clearly