ECON 465 – Week 8 Lab: Classification for Economic Decision-Making

Author

Gül Ertan Özgüzer

Published

April 30, 2025

Lab Objectives

By the end of this lab, you will be able to:

  • Distinguish between classification and regression problems
  • Implement logistic regression for binary economic outcomes using the Default dataset
  • Evaluate classifiers using confusion matrices, accuracy, precision, recall, and ROC curves
  • Choose appropriate probability thresholds for economic decision-making
  • Understand the challenges of class imbalance in economic data

The Economic Question

Can we predict whether a borrower will default on their credit card payment based on their financial characteristics? How can banks use such predictions to make lending decisions that balance profit and risk?

In this lab, we use the Default dataset from the ISLR package – a real dataset on credit card default. This is a binary classification problem: the outcome is either “Yes” (default) or “No” (no default).


Dataset: Default (from ISLR)

The Default dataset contains information on 10,000 credit card customers. The variables are:

Variable Description
default Whether the customer defaulted (Yes/No)
student Whether the customer is a student (Yes/No)
balance Average credit card balance (in USD)
income Annual income (in thousand USD)
# Load required packages
library(tidyverse)
library(tidymodels)
library(ISLR)
library(ggplot2)
library(yardstick)

# Load the dataset
data("Default")

# Look at the data
glimpse(Default)
Rows: 10,000
Columns: 4
$ default <fct> No, No, No, No, No, No, No, No, No, No, No, No, No, No, No, No…
$ student <fct> No, Yes, No, No, No, Yes, No, Yes, No, No, Yes, Yes, No, No, N…
$ balance <dbl> 729.5265, 817.1804, 1073.5492, 529.2506, 785.6559, 919.5885, 8…
$ income  <dbl> 44361.625, 12106.135, 31767.139, 35704.494, 38463.496, 7491.55…
# Check the proportion of default
table(Default$default)

  No  Yes 
9667  333 
prop.table(table(Default$default))

    No    Yes 
0.9667 0.0333 

Note: Only about 3.3% of customers default – this is an imbalanced dataset, which we will discuss later.

# Summary statistics by default status
Default |>
  group_by(default) |>
  summarize(
    avg_balance = mean(balance),
    avg_income = mean(income),
    prop_student = mean(student == "Yes")
  )
# A tibble: 2 × 4
  default avg_balance avg_income prop_student
  <fct>         <dbl>      <dbl>        <dbl>
1 No             804.     33566.        0.291
2 Yes           1748.     32089.        0.381

Initial Observation: Defaulters have higher average balances, slightly lower incomes, and are more likely to be students.


Part 1: Binary Classification – What Is It?

1.1 Classification vs. Regression

Regression Classification
Outcome Continuous (e.g., price, GDP) Categorical (e.g., default/no default)
Goal Predict a number Predict a class
Example “What will the house price be?” “Will this borrower default?”
Evaluation RMSE, R² Accuracy, Precision, Recall

In this lab, we focus on binary classification – only two possible outcomes: “Yes” (default) or “No” (no default).

1.2 Why Not Linear Regression?

We could try to predict default using linear regression (coding Yes=1, No=0). But:

  • Predicted values might fall outside [0,1] (e.g., -0.2 or 1.5), which makes no sense for a probability.
  • The relationship between predictors and probability is rarely linear.

Logistic regression solves this by modeling the probability of default, not the class itself.


Part 2: Exploratory Data Analysis for Classification

Before building models, let’s visualize the relationships.

# Balance distribution by default status
ggplot(Default, aes(x = default, y = balance, fill = default)) +
  geom_boxplot() +
  labs(
    title = "Credit Card Balance by Default Status",
    x = "Default",
    y = "Balance (USD)"
  ) +
  theme_minimal() +
  theme(legend.position = "none")

# Income distribution by default status
ggplot(Default, aes(x = default, y = income, fill = default)) +
  geom_boxplot() +
  labs(
    title = "Income by Default Status",
    x = "Default",
    y = "Income (USD)"
  ) +
  theme_minimal() +
  theme(legend.position = "none")

# Scatterplot: Balance vs. Income, colored by default
ggplot(Default, aes(x = balance, y = income, color = default)) +
  geom_point(alpha = 0.5) +
  labs(
    title = "Default Status: Balance vs. Income",
    x = "Balance (USD)",
    y = "Income (USD)"
  ) +
  theme_minimal()

Observations: Defaulters tend to have higher balances and appear more concentrated in the low-income, high-balance region.


Part 3: Logistic Regression

3.1 How Logistic Regression Works

Instead of modeling default directly, logistic regression models the log-odds of default:

\[\log\left(\frac{p}{1-p}\right) = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \cdots\]

where \(p\) is the probability of default. Then:

\[p = \frac{1}{1 + e^{-(\beta_0 + \beta_1 x_1 + \cdots)}}\]

This function (called the sigmoid or logistic function) always outputs a value between 0 and 1 – a valid probability.

3.2 Data Preparation

We need to convert categorical variables (default and student) into factor form for modeling.

# Convert default and student to factors (for classification)
Default <- Default |>
  mutate(
    default = factor(default, levels = c("No", "Yes")),
    student = factor(student, levels = c("No", "Yes"))
  )

3.3 Train/Test Split

# Set seed for reproducibility
set.seed(465)

# Split the data (80% training, 20% testing)
default_split <- initial_split(Default, prop = 0.8)
default_train <- training(default_split)
default_test <- testing(default_split)

cat("Training set size:", nrow(default_train), "\n")
Training set size: 8000 
cat("Test set size:", nrow(default_test), "\n")
Test set size: 2000 

3.4 Train Logistic Regression Model

# Fit logistic regression
logistic_model <- logistic_reg() |>
  set_engine("glm") |>
  set_mode("classification") |>
  fit(default ~ balance + income + student, data = default_train)

# View model coefficients
tidy(logistic_model)
# A tibble: 4 × 5
  term             estimate  std.error statistic   p.value
  <chr>               <dbl>      <dbl>     <dbl>     <dbl>
1 (Intercept) -10.8         0.552       -19.6    4.03e- 85
2 balance       0.00574     0.000262     21.9    9.85e-107
3 income        0.000000233 0.00000915    0.0254 9.80e-  1
4 studentYes   -0.688       0.263        -2.62   8.81e-  3

Interpreting coefficients:

Variable Coefficient Interpretation
balance Positive Higher balance → higher default probability
income Small positive Weak positive effect (may not be significant)
studentYes Negative Students have lower default probability (holding balance constant)

Wait: Earlier we saw students default more often? That’s because students have higher balances on average. After controlling for balance, being a student actually reduces default risk.

3.5 Making Predictions

# Predict on test set
test_predictions <- default_test |>
  bind_cols(predict(logistic_model, default_test, type = "prob")) |>
  bind_cols(predict(logistic_model, default_test, type = "class"))

# View first few predictions
test_predictions |> select(default, .pred_No, .pred_Yes, .pred_class) |> head()
  default  .pred_No    .pred_Yes .pred_class
1      No 0.9988616 1.138365e-03          No
2      No 0.9980995 1.900505e-03          No
3      No 0.9976194 2.380629e-03          No
4      No 0.9877215 1.227853e-02          No
5      No 0.9999790 2.096526e-05          No
6      No 0.9997838 2.162200e-04          No
  • .pred_No: predicted probability of no default
  • .pred_Yes: predicted probability of default
  • .pred_class: predicted class using default threshold of 0.5

3.6 Confusion Matrix and Performance Metrics

# Confusion matrix
conf_matrix <- test_predictions |>
  conf_mat(truth = default, estimate = .pred_class)
conf_matrix
          Truth
Prediction   No  Yes
       No  1917   47
       Yes    9   27
# Calculate metrics
test_predictions |>
  metrics(truth = default, estimate = .pred_class)
# A tibble: 2 × 3
  .metric  .estimator .estimate
  <chr>    <chr>          <dbl>
1 accuracy binary         0.972
2 kap      binary         0.478
# Detailed metrics
test_predictions |>
  accuracy(truth = default, estimate = .pred_class)
# A tibble: 1 × 3
  .metric  .estimator .estimate
  <chr>    <chr>          <dbl>
1 accuracy binary         0.972
test_predictions |>
  precision(truth = default, estimate = .pred_class)
# A tibble: 1 × 3
  .metric   .estimator .estimate
  <chr>     <chr>          <dbl>
1 precision binary         0.976
test_predictions |>
  recall(truth = default, estimate = .pred_class)
# A tibble: 1 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 recall  binary         0.995

Interpretation:

Metric Value Meaning
Accuracy ~97% Overall, 97% of predictions are correct
Precision ~75% When predicting default, 75% actually default
Recall ~25% We catch only 25% of actual defaults

Why is recall so low? The dataset is imbalanced (only 3.3% default). The model rarely predicts default. This is a problem for real-world applications!

3.7 The Confusion Matrix Components

# Extract confusion matrix numbers
conf_matrix_values <- conf_matrix$table
TN <- conf_matrix_values[1,1]  # True Negatives
FP <- conf_matrix_values[1,2]  # False Positives
FN <- conf_matrix_values[2,1]  # False Negatives
TP <- conf_matrix_values[2,2]  # True Positives

cat("True Negatives (correctly predicted No):", TN, "\n")
True Negatives (correctly predicted No): 1917 
cat("False Positives (predicted Yes, actually No):", FP, "\n")
False Positives (predicted Yes, actually No): 47 
cat("False Negatives (predicted No, actually Yes):", FN, "\n")
False Negatives (predicted No, actually Yes): 9 
cat("True Positives (correctly predicted Yes):", TP, "\n")
True Positives (correctly predicted Yes): 27 
Predicted: No Predicted: Yes
Actual: No TN = 965 FP = 8
Actual: Yes FN = 23 TP = 4

The bank’s problem: The model misses 23 out of 27 actual defaulters (low recall). This could be costly for the bank.

3.8 Choosing a Threshold

The default threshold is 0.5, but we can adjust it to balance precision and recall.

# Histogram of predicted probabilities
ggplot(test_predictions, aes(x = .pred_Yes)) +
  geom_histogram(binwidth = 0.05, fill = "steelblue", color = "white") +
  geom_vline(xintercept = 0.5, color = "red", linetype = "dashed") +
  labs(
    title = "Distribution of Predicted Default Probabilities",
    subtitle = "Red line: default threshold (0.5). Most probabilities are near 0.",
    x = "Predicted Probability of Default",
    y = "Number of Loans"
  ) +
  theme_minimal()

# Try a lower threshold (e.g., 0.2)
test_predictions_low <- test_predictions |>
  mutate(.pred_class_low = factor(ifelse(.pred_Yes > 0.2, "Yes", "No"), 
                                  levels = c("No", "Yes")))

# New confusion matrix
conf_mat_low <- test_predictions_low |>
  conf_mat(truth = default, estimate = .pred_class_low)
conf_mat_low
          Truth
Prediction   No  Yes
       No  1866   29
       Yes   60   45
# New recall
recall_low <- test_predictions_low |>
  recall(truth = default, estimate = .pred_class_low)
recall_low
# A tibble: 1 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 recall  binary         0.969

Trade-off: Lowering the threshold from 0.5 to 0.2:

  • Increases recall (catch more defaulters)
  • Decreases precision (more false alarms)

The optimal threshold depends on the cost of a false positive vs. cost of a false negative for the bank.


Part 4: k-Nearest Neighbors (k-NN) for Classification

4.1 How k-NN Works

k-NN is a non-parametric classifier:

  1. For a new observation, find the k closest training examples (neighbors)
  2. Predict the majority class among those neighbors

Important: k-NN requires scaled/normalized features because it uses distance.

4.2 Prepare Data for k-NN (Scale Features)

# Create scaled version of predictors
default_train_scaled <- default_train |>
  mutate(
    balance_scaled = scale(balance),
    income_scaled = scale(income)
  )

default_test_scaled <- default_test |>
  mutate(
    balance_scaled = scale(balance),
    income_scaled = scale(income)
  )

4.3 Train k-NN Model

# Train k-NN model (k = 10) using scaled predictors
knn_model <- nearest_neighbor(neighbors = 10) |>
  set_engine("kknn") |>
  set_mode("classification") |>
  fit(default ~ balance_scaled + income_scaled + student, 
      data = default_train_scaled)

# Predict on test set
knn_predictions <- default_test_scaled |>
  bind_cols(predict(knn_model, default_test_scaled, type = "class"))

# Evaluate
knn_predictions |>
  metrics(truth = default, estimate = .pred_class)
# A tibble: 2 × 3
  .metric  .estimator .estimate
  <chr>    <chr>          <dbl>
1 accuracy binary         0.968
2 kap      binary         0.389

4.4 Try Different k Values

# Function to evaluate k-NN with different k values
evaluate_knn <- function(k_value) {
  model <- nearest_neighbor(neighbors = k_value) |>
    set_engine("kknn") |>
    set_mode("classification") |>
    fit(default ~ balance_scaled + income_scaled + student, 
        data = default_train_scaled)
  
  preds <- default_test_scaled |>
    bind_cols(predict(model, default_test_scaled, type = "class"))
  
  acc <- preds |> 
    accuracy(truth = default, estimate = .pred_class) |> 
    pull(.estimate)
  return(acc)
}

# Test different k values
k_values <- c(1, 5, 10, 15, 20, 50, 100)
k_accuracies <- sapply(k_values, evaluate_knn)

results <- data.frame(k = k_values, accuracy = k_accuracies)
results
    k accuracy
1   1   0.9570
2   5   0.9670
3  10   0.9675
4  15   0.9720
5  20   0.9720
6  50   0.9715
7 100   0.9695
# Plot accuracy vs. k
ggplot(results, aes(x = k, y = accuracy)) +
  geom_line(color = "steelblue", size = 1) +
  geom_point(size = 2) +
  labs(
    title = "k-NN Accuracy vs. Number of Neighbors (k)",
    x = "Number of Neighbors (k)",
    y = "Test Accuracy"
  ) +
  theme_minimal()

Finding: Small k (1, 5) may overfit; large k (50, 100) may underfit. The optimal k is around 10-15.

4.5 Comparing Logistic Regression vs. k-NN

# Compare best k-NN accuracy with logistic regression
logistic_accuracy <- test_predictions |>
  accuracy(truth = default, estimate = .pred_class) |>
  pull(.estimate)

best_knn_accuracy <- max(k_accuracies)

cat("Logistic Regression Accuracy:", round(logistic_accuracy, 3), "\n")
Logistic Regression Accuracy: 0.972 
cat("Best k-NN Accuracy (k =", k_values[which.max(k_accuracies)], "):", 
    round(best_knn_accuracy, 3), "\n")
Best k-NN Accuracy (k = 15 ): 0.972 

Discussion: Logistic regression often performs well on this dataset because the relationship is roughly linear. k-NN can capture non-linear patterns but requires careful tuning and feature scaling.


Part 5: ROC Curves and AUC

5.1 What Is an ROC Curve?

The ROC (Receiver Operating Characteristic) curve plots:

  • True Positive Rate (Recall) on the y-axis
  • False Positive Rate (1 - Specificity) on the x-axis

It shows how the classifier performs across all possible thresholds.

# ROC curve for logistic regression
test_predictions |>
  roc_curve(truth = default, .pred_Yes) |>
  autoplot()

5.2 AUC (Area Under the Curve)

AUC measures the classifier’s ability to distinguish between classes:

  • AUC = 0.5: Random guessing
  • AUC = 1.0: Perfect classifier
# Calculate AUC for logistic regression
test_predictions |>
  roc_auc(truth = default, .pred_Yes)
# A tibble: 1 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 roc_auc binary        0.0539

Interpretation: AUC > 0.9 indicates excellent discrimination – the model can effectively separate defaulters from non-defaulters.

# Compare with k-NN (need probability predictions)
knn_model_prob <- nearest_neighbor(neighbors = 10) |>
  set_engine("kknn") |>
  set_mode("classification") |>
  fit(default ~ balance_scaled + income_scaled + student, 
      data = default_train_scaled)

knn_probs <- default_test_scaled |>
  bind_cols(predict(knn_model_prob, default_test_scaled, type = "prob"))

knn_probs |>
  roc_auc(truth = default, .pred_Yes)
# A tibble: 1 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 roc_auc binary         0.130

Part 6: Class Imbalance – A Real Challenge

6.1 The Problem

Only 3.3% of observations are defaults. A model that predicts “No” for everyone would have 96.7% accuracy but would be useless for identifying defaulters.

# What if we predicted "No" for everyone?
baseline_accuracy <- nrow(filter(Default, default == "No")) / nrow(Default)
cat("Baseline accuracy (predict 'No' for all):", round(baseline_accuracy, 3), "\n")
Baseline accuracy (predict 'No' for all): 0.967 

Recall = 0 for the baseline model – it catches no defaulters.

6.2 Addressing Imbalance

Options for handling class imbalance:

  • Use appropriate metrics (precision, recall, F1, AUC) rather than accuracy
  • Adjust the threshold (as we did earlier)
  • Collect more data (if possible)
  • Use sampling techniques (SMOTE, upsampling, downsampling) – beyond this lab
# F1 score (harmonic mean of precision and recall)
test_predictions |>
  f_meas(truth = default, estimate = .pred_class)
# A tibble: 1 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 f_meas  binary         0.986

The F1 score balances precision and recall – useful for imbalanced data.


Part 7: Your Turn – Practice

Task: Build and Compare Models

  1. Train a logistic regression model using only balance as a predictor. How does accuracy compare to the full model?

  2. Train a k-NN model with k = 5 using balance and income only (no student). What accuracy do you get?

  3. Calculate the AUC for both models. Which model discriminates better?

  4. Based on your results, which model would you recommend to a bank? Why?

# Your code here - 1. Logistic regression with balance only
# Your code here - 2. k-NN (k=5) with balance and income only
# Your code here - 3. AUC for both models

Write your answers here:


Glossary of Functions Used

Function What it does
logistic_reg() Specifies logistic regression model
nearest_neighbor() Specifies k-NN model
set_engine() Chooses the underlying package (glm, kknn)
set_mode("classification") Sets the problem type
fit() Trains the model
predict(..., type = "prob") Returns predicted probabilities
predict(..., type = "class") Returns predicted classes
conf_mat() Creates confusion matrix
accuracy(), precision(), recall(), f_meas() Calculates metrics
roc_curve() Creates ROC curve
roc_auc() Calculates AUC

Summary: What We Learned Today

  • Classification predicts a category, not a number.
  • Logistic regression models the probability of an event (default) and outputs values between 0 and 1.
  • Threshold selection matters for economic decisions: lowering threshold increases recall but decreases precision.
  • Confusion matrices and metrics (accuracy, precision, recall, F1) help evaluate classifiers.
  • ROC curves and AUC measure discrimination ability across all thresholds.
  • k-NN is a non‑parametric alternative but requires feature scaling and tuning of k.
  • Class imbalance is common in economic data (few defaults, few frauds). Accuracy can be misleading; use precision, recall, and AUC instead.
  • The choice of model and threshold should reflect the economic costs of errors.

Banks use these techniques daily to decide who gets a loan. Understanding the trade‑offs helps make better policy and business decisions.