ECON 465 – Week 9 Lab: Cross-Validation & Model Selection

Author

Gül Ertan Özgüzer

Published

May 8, 2025

Lab Objectives

By the end of this lab, you will be able to:

Understand why cross-validation is better than a single train/test split
Implement k-fold cross-validation for classification and regression
Use cross-validation to compare different models
Interpret cross-validation results using mean performance and standard error
Choose the best model based on more reliable evidence

The Economic Questions

We will use two different datasets in this lab.

Classification: We use the Default dataset to predict whether a borrower defaults on a credit card payment.
Regression: We use the mtcars dataset to predict a car’s fuel consumption, measured by miles per gallon, based on its characteristics.

Cross-validation works in a similar way for both types of problems. The main difference is that we use different evaluation metrics.

For classification, we use metrics such as accuracy, precision, and recall.

For regression, we use metrics such as RMSE and R-squared.

Part 1: Why Cross-Validation?

1.1 The Problem with a Single Train/Test Split

Last week, we split the data once into two parts:

80% training data
20% test data

This is useful, but it has an important limitation.

The test-set performance depends on which 20% of the observations we happened to leave out.

A different random split could give a different accuracy, recall, or RMSE.

Therefore, a single train/test split may give us a performance estimate that is too optimistic or too pessimistic.

1.2 The Solution: Cross-Validation

Cross-validation repeats the train/test idea many times.

Instead of using only one test set, it creates several validation sets and averages the results.

This gives a more reliable estimate of how well the model will perform on new data.

1.3 How k-Fold Cross-Validation Works

In k-fold cross-validation:

We divide the data into k equal-sized folds.
We train the model on k - 1 folds.
We evaluate the model on the remaining fold.
We repeat this process k times.
We average the k validation scores.

For example, in 5-fold cross-validation, the data is divided into 5 parts.

Each part is used once as a validation set, and the other 4 parts are used as the training set.

The final result is the average performance across the 5 validation folds.

1.4 Why Is Cross-Validation Useful?

Cross-validation is useful because:

Every observation is used for validation exactly once.
Every observation is used for training several times.
The result is less dependent on one random split.
It gives both an average performance measure and a measure of variability.

Part 2: Prepare the Data for Classification

We first use the Default dataset from the ISLR package.

The economic question is:

Can we predict whether a borrower will default on a credit card payment?

Dataset: Default

Variable	Description
`default`	Whether the customer defaulted: Yes or No
`student`	Whether the customer is a student: Yes or No
`balance`	Average credit card balance in USD
`income`	Annual income in USD

# Load packages
library(tidyverse)
library(ISLR)
library(tidymodels)

# Load the Default dataset
data("Default")

# Prepare the data
Default <- Default |>
  mutate(
    default = factor(default, levels = c("No", "Yes")),
    student = factor(student, levels = c("No", "Yes"))
  )

# Look at the data
glimpse(Default)

Rows: 10,000
Columns: 4
$ default <fct> No, No, No, No, No, No, No, No, No, No, No, No, No, No, No, No…
$ student <fct> No, Yes, No, No, No, Yes, No, Yes, No, No, Yes, Yes, No, No, N…
$ balance <dbl> 729.5265, 817.1804, 1073.5492, 529.2506, 785.6559, 919.5885, 8…
$ income  <dbl> 44361.625, 12106.135, 31767.139, 35704.494, 38463.496, 7491.55…

# How many customers defaulted?
table(Default$default)


  No  Yes 
9667  333

Only about 3.3% of customers defaulted. This means the dataset is imbalanced.

Most customers did not default.

This is important because accuracy can be misleading in imbalanced datasets.

For example, a model that predicts “No default” for everyone would already have very high accuracy, but it would be useless for identifying actual defaulters.

Part 3: Cross-Validation for Classification

3.1 Define the Logistic Regression Model

We begin by defining the model specification.

logistic_spec <- logistic_reg() |>
  set_engine("glm") |>
  set_mode("classification")

Command	What it does
`logistic_reg()`	Creates a model specification for logistic regression
`set_engine("glm")`	Uses the standard R `glm()` function as the computational engine
`set_mode("classification")`	Declares that we are predicting a category

3.2 Create Cross-Validation Folds

Now we create 5 folds.

set.seed(465)

folds <- vfold_cv(Default, v = 5)
folds

#  5-fold cross-validation 
# A tibble: 5 × 2
  splits              id   
  <list>              <chr>
1 <split [8000/2000]> Fold1
2 <split [8000/2000]> Fold2
3 <split [8000/2000]> Fold3
4 <split [8000/2000]> Fold4
5 <split [8000/2000]> Fold5

Command	What it does
`set.seed(465)`	Makes the random splits reproducible
`vfold_cv(Default, v = 5)`	Creates a 5-fold cross-validation object

3.3 Evaluate the Full Logistic Regression Model

The full model uses three predictors:

balance
income
student

cv_results_logistic <- fit_resamples(   # Worhorse function of CV
  logistic_spec,                        # Model definition (logistic regression)
  default ~ balance + income + student, # Formula (outcome ~ predictors)
  resamples = folds,                    # The 5‑fold CV object created earlier
  metrics = metric_set(accuracy, precision, recall)  # What to measure
)

collect_metrics(cv_results_logistic)

# A tibble: 3 × 6
  .metric   .estimator  mean     n  std_err .config        
  <chr>     <chr>      <dbl> <int>    <dbl> <chr>          
1 accuracy  binary     0.973     5 0.00122  pre0_mod0_post0
2 precision binary     0.977     5 0.00101  pre0_mod0_post0
3 recall    binary     0.996     5 0.000501 pre0_mod0_post0

Argument	What it does
`default ~ balance + income + student`	Predicts default using all three predictors
`resamples = folds`	Uses the 5 folds created above
`metrics = metric_set(...)`	Requests accuracy, precision, and recall

The output shows the cross-validated performance of the full logistic regression model.

Part 4: Compare the Full Model and the Simple Model

Now we ask:

Does adding income and student improve prediction beyond balance alone?

To answer this, we compare two models using the same cross-validation folds.

Model	Formula
Full model	`default ~ balance + income + student`
Simple model	`default ~ balance`

# Simple model: only balance
cv_results_balance <- fit_resamples(
  logistic_spec,
  default ~ balance,
  resamples = folds,
  metrics = metric_set(accuracy, precision, recall)
)

# Combine results for comparison
full_metrics <- collect_metrics(cv_results_logistic) |>
  mutate(model = "Full: balance + income + student")

simple_metrics <- collect_metrics(cv_results_balance) |>
  mutate(model = "Simple: balance only")

comparison <- bind_rows(full_metrics, simple_metrics) |>
  filter(.metric %in% c("accuracy", "precision", "recall")) |>
  select(model, .metric, mean, std_err)

comparison

# A tibble: 6 × 4
  model                            .metric    mean  std_err
  <chr>                            <chr>     <dbl>    <dbl>
1 Full: balance + income + student accuracy  0.973 0.00122 
2 Full: balance + income + student precision 0.977 0.00101 
3 Full: balance + income + student recall    0.996 0.000501
4 Simple: balance only             accuracy  0.972 0.00144 
5 Simple: balance only             precision 0.976 0.000979
6 Simple: balance only             recall    0.995 0.00106

Part 5: Interpreting Cross-Validation Results

5.1 The Meaning of `mean` and `std_err`

When you run collect_metrics(), the output includes two important columns:

Column	Meaning
`mean`	The average value of the metric across the k folds
`std_err`	The standard error of the metric across the folds

The mean is our best estimate of the model’s true performance.

The std_err tells us how much performance varies across different validation folds.

A smaller standard error means performance is more stable across different data splits.

5.2 Example Interpretation

Suppose we obtain the following output:

Metric	Mean	Standard Error
Accuracy	0.972	0.0023
Precision	0.751	0.0450
Recall	0.263	0.0312

This would mean:

Accuracy is very stable because the standard error is small.
Recall has more variability because the standard error is larger.
The model’s ability to catch defaulters changes more across different data splits.

5.3 How to Decide Which Model Is Better

Do not rely only on the mean.

You should also consider the standard error.

A useful rule of thumb is:

If the difference between two model means is large relative to the standard error, the difference is more likely to be meaningful.

We can compare the recall of the full model and the simple model.

# Extract recall means and standard errors
recall_full <- comparison |>
  filter(model == "Full: balance + income + student", .metric == "recall") |>
  pull(mean)

recall_simple <- comparison |>
  filter(model == "Simple: balance only", .metric == "recall") |>
  pull(mean)

se_full <- comparison |>
  filter(model == "Full: balance + income + student", .metric == "recall") |>
  pull(std_err)

se_simple <- comparison |>
  filter(model == "Simple: balance only", .metric == "recall") |>
  pull(std_err)

# Approximate standard error of the difference
se_diff <- sqrt(se_full^2 + se_simple^2)

diff <- recall_full - recall_simple

cat("Difference in recall:", round(diff, 4), "\n")

Difference in recall: 6e-04

cat("Standard error of difference:", round(se_diff, 4), "\n")

Standard error of difference: 0.0012

cat("Difference / SE:", round(diff / se_diff, 2), "\n")

Difference / SE: 0.53

5.4 Interpretation

If the ratio Difference / SE is greater than 2 in absolute value, we have stronger evidence that the models truly differ.

If the ratio is small, the apparent difference may simply be due to random variation across folds.

5.5 Practical Takeaway for Economic Decision-Making

When choosing a model for a bank, we want reliable performance.

Cross-validation gives us two pieces of information:

The average performance of the model.
How much that performance varies across different samples.

A model with slightly lower mean recall but much smaller standard error may be preferable because its performance is more predictable.

Part 6: Cross-Validation for Regression

Now we apply the same idea to a regression problem.

We use the mtcars dataset.

The economic question is:

Can we predict a car’s fuel consumption based on its characteristics?

The outcome variable is mpg, or miles per gallon.

Because mpg is numerical, this is a regression problem.

6.1 Define the Linear Regression Model

lm_spec <- linear_reg() |>
  set_engine("lm") |>
  set_mode("regression")

6.2 Create Folds for the `mtcars` Dataset

set.seed(465)

folds_mpg <- vfold_cv(mtcars, v = 5)
folds_mpg

#  5-fold cross-validation 
# A tibble: 5 × 2
  splits         id   
  <list>         <chr>
1 <split [25/7]> Fold1
2 <split [25/7]> Fold2
3 <split [26/6]> Fold3
4 <split [26/6]> Fold4
5 <split [26/6]> Fold5

6.3 Evaluate a Regression Model with Cross-Validation

We estimate a regression model using:

wt: weight of the car
hp: horsepower
cyl: number of cylinders

cv_results_lm <- fit_resamples(
  lm_spec,
  mpg ~ wt + hp + cyl,
  resamples = folds_mpg,
  metrics = metric_set(rmse, rsq)
)

collect_metrics(cv_results_lm)

# A tibble: 2 × 6
  .metric .estimator  mean     n std_err .config        
  <chr>   <chr>      <dbl> <int>   <dbl> <chr>          
1 rmse    standard   2.66      5  0.318  pre0_mod0_post0
2 rsq     standard   0.840     5  0.0218 pre0_mod0_post0

6.4 Regression Metrics

Metric	Meaning	Better Value
RMSE	Root Mean Squared Error; average prediction error in mpg	Lower is better
R-squared	Proportion of variation in `mpg` explained by the model	Higher is better

If RMSE is lower, the model makes smaller prediction errors.

If R-squared is higher, the model explains more variation in fuel consumption.

Part 7: Compare Two Regression Models Using Cross-Validation

Now we compare two regression models.

Model	Formula
Model A	`mpg ~ wt + hp + cyl`
Model B	`mpg ~ wt + hp + cyl + am`

The variable am indicates transmission type.

# Model A: wt + hp + cyl
cv_modelA <- fit_resamples(
  lm_spec,
  mpg ~ wt + hp + cyl,
  resamples = folds_mpg,
  metrics = metric_set(rmse, rsq)
)

# Model B: wt + hp + cyl + am
cv_modelB <- fit_resamples(
  lm_spec,
  mpg ~ wt + hp + cyl + am,
  resamples = folds_mpg,
  metrics = metric_set(rmse, rsq)
)

# Compare the two models
regression_comparison <- bind_rows(
  collect_metrics(cv_modelA) |> mutate(model = "Model A: wt + hp + cyl"),
  collect_metrics(cv_modelB) |> mutate(model = "Model B: wt + hp + cyl + am")
) |>
  select(model, .metric, mean, std_err)

regression_comparison

# A tibble: 4 × 4
  model                       .metric  mean std_err
  <chr>                       <chr>   <dbl>   <dbl>
1 Model A: wt + hp + cyl      rmse    2.66   0.318 
2 Model A: wt + hp + cyl      rsq     0.840  0.0218
3 Model B: wt + hp + cyl + am rmse    2.58   0.299 
4 Model B: wt + hp + cyl + am rsq     0.832  0.0297

7.1 Which Model Is Better?

For RMSE, lower is better.

For R-squared, higher is better.

If Model B has only a slightly lower RMSE but a larger standard error, the improvement may not be meaningful.

If the two models perform similarly, we may prefer the simpler model, Model A, to avoid unnecessary complexity.

Part 8: Your Turn – Simple Practice

Task 1: Compare Classification Models

Use 5-fold cross-validation to evaluate a logistic regression model with the following formula:

default ~ balance + student

This model excludes income.

Compare its cross-validated recall to:

the full model: default ~ balance + income + student
the balance-only model: default ~ balance

# Your code here

Write your observations here:

Which model catches the most defaulters?

Is the difference meaningful when you consider the standard errors?

Task 2: Interpret Regression Results

Using the mtcars dataset, cross-validate the following two models:

mpg ~ wt

and

mpg ~ wt + hp + cyl + disp + am

Compare their RMSE and R-squared.

# Your code here