ECON 465 – Week 9 Lab: Cross-Validation & Model Selection

Author

Gül Ertan Özgüzer

Published

May 8, 2025

Lab Objectives

By the end of this lab, you will be able to:

  • Understand why cross-validation is better than a single train/test split
  • Implement k-fold cross-validation for classification and regression
  • Use cross-validation to compare different models
  • Interpret cross-validation results using mean performance and standard error
  • Choose the best model based on more reliable evidence

The Economic Questions

We will use two different datasets in this lab.

  1. Classification: We use the Default dataset to predict whether a borrower defaults on a credit card payment.

  2. Regression: We use the mtcars dataset to predict a car’s fuel consumption, measured by miles per gallon, based on its characteristics.

Cross-validation works in a similar way for both types of problems. The main difference is that we use different evaluation metrics.

For classification, we use metrics such as accuracy, precision, and recall.

For regression, we use metrics such as RMSE and R-squared.


Part 1: Why Cross-Validation?

1.1 The Problem with a Single Train/Test Split

Last week, we split the data once into two parts:

  • 80% training data
  • 20% test data

This is useful, but it has an important limitation.

The test-set performance depends on which 20% of the observations we happened to leave out.

A different random split could give a different accuracy, recall, or RMSE.

Therefore, a single train/test split may give us a performance estimate that is too optimistic or too pessimistic.

1.2 The Solution: Cross-Validation

Cross-validation repeats the train/test idea many times.

Instead of using only one test set, it creates several validation sets and averages the results.

This gives a more reliable estimate of how well the model will perform on new data.

1.3 How k-Fold Cross-Validation Works

In k-fold cross-validation:

  1. We divide the data into k equal-sized folds.
  2. We train the model on k - 1 folds.
  3. We evaluate the model on the remaining fold.
  4. We repeat this process k times.
  5. We average the k validation scores.

For example, in 5-fold cross-validation, the data is divided into 5 parts.

Each part is used once as a validation set, and the other 4 parts are used as the training set.

The final result is the average performance across the 5 validation folds.

1.4 Why Is Cross-Validation Useful?

Cross-validation is useful because:

  • Every observation is used for validation exactly once.
  • Every observation is used for training several times.
  • The result is less dependent on one random split.
  • It gives both an average performance measure and a measure of variability.

Part 2: Prepare the Data for Classification

We first use the Default dataset from the ISLR package.

The economic question is:

Can we predict whether a borrower will default on a credit card payment?

Dataset: Default

Variable Description
default Whether the customer defaulted: Yes or No
student Whether the customer is a student: Yes or No
balance Average credit card balance in USD
income Annual income in USD
# Load packages
library(tidyverse)
library(ISLR)
library(tidymodels)

# Load the Default dataset
data("Default")

# Prepare the data
Default <- Default |>
  mutate(
    default = factor(default, levels = c("No", "Yes")),
    student = factor(student, levels = c("No", "Yes"))
  )

# Look at the data
glimpse(Default)
Rows: 10,000
Columns: 4
$ default <fct> No, No, No, No, No, No, No, No, No, No, No, No, No, No, No, No…
$ student <fct> No, Yes, No, No, No, Yes, No, Yes, No, No, Yes, Yes, No, No, N…
$ balance <dbl> 729.5265, 817.1804, 1073.5492, 529.2506, 785.6559, 919.5885, 8…
$ income  <dbl> 44361.625, 12106.135, 31767.139, 35704.494, 38463.496, 7491.55…
# How many customers defaulted?
table(Default$default)

  No  Yes 
9667  333 

Only about 3.3% of customers defaulted. This means the dataset is imbalanced.

Most customers did not default.

This is important because accuracy can be misleading in imbalanced datasets.

For example, a model that predicts “No default” for everyone would already have very high accuracy, but it would be useless for identifying actual defaulters.


Part 3: Cross-Validation for Classification

3.1 Define the Logistic Regression Model

We begin by defining the model specification.

logistic_spec <- logistic_reg() |>
  set_engine("glm") |>
  set_mode("classification")
Command What it does
logistic_reg() Creates a model specification for logistic regression
set_engine("glm") Uses the standard R glm() function as the computational engine
set_mode("classification") Declares that we are predicting a category

3.2 Create Cross-Validation Folds

Now we create 5 folds.

set.seed(465)

folds <- vfold_cv(Default, v = 5)
folds
#  5-fold cross-validation 
# A tibble: 5 × 2
  splits              id   
  <list>              <chr>
1 <split [8000/2000]> Fold1
2 <split [8000/2000]> Fold2
3 <split [8000/2000]> Fold3
4 <split [8000/2000]> Fold4
5 <split [8000/2000]> Fold5
Command What it does
set.seed(465) Makes the random splits reproducible
vfold_cv(Default, v = 5) Creates a 5-fold cross-validation object

3.3 Evaluate the Full Logistic Regression Model

The full model uses three predictors:

  • balance
  • income
  • student
cv_results_logistic <- fit_resamples(   # Worhorse function of CV
  logistic_spec,                        # Model definition (logistic regression)
  default ~ balance + income + student, # Formula (outcome ~ predictors)
  resamples = folds,                    # The 5‑fold CV object created earlier
  metrics = metric_set(accuracy, precision, recall)  # What to measure
)

collect_metrics(cv_results_logistic)
# A tibble: 3 × 6
  .metric   .estimator  mean     n  std_err .config        
  <chr>     <chr>      <dbl> <int>    <dbl> <chr>          
1 accuracy  binary     0.973     5 0.00122  pre0_mod0_post0
2 precision binary     0.977     5 0.00101  pre0_mod0_post0
3 recall    binary     0.996     5 0.000501 pre0_mod0_post0
Argument What it does
default ~ balance + income + student Predicts default using all three predictors
resamples = folds Uses the 5 folds created above
metrics = metric_set(...) Requests accuracy, precision, and recall

The output shows the cross-validated performance of the full logistic regression model.


Part 4: Compare the Full Model and the Simple Model

Now we ask:

Does adding income and student improve prediction beyond balance alone?

To answer this, we compare two models using the same cross-validation folds.

Model Formula
Full model default ~ balance + income + student
Simple model default ~ balance
# Simple model: only balance
cv_results_balance <- fit_resamples(
  logistic_spec,
  default ~ balance,
  resamples = folds,
  metrics = metric_set(accuracy, precision, recall)
)

# Combine results for comparison
full_metrics <- collect_metrics(cv_results_logistic) |>
  mutate(model = "Full: balance + income + student")

simple_metrics <- collect_metrics(cv_results_balance) |>
  mutate(model = "Simple: balance only")

comparison <- bind_rows(full_metrics, simple_metrics) |>
  filter(.metric %in% c("accuracy", "precision", "recall")) |>
  select(model, .metric, mean, std_err)

comparison
# A tibble: 6 × 4
  model                            .metric    mean  std_err
  <chr>                            <chr>     <dbl>    <dbl>
1 Full: balance + income + student accuracy  0.973 0.00122 
2 Full: balance + income + student precision 0.977 0.00101 
3 Full: balance + income + student recall    0.996 0.000501
4 Simple: balance only             accuracy  0.972 0.00144 
5 Simple: balance only             precision 0.976 0.000979
6 Simple: balance only             recall    0.995 0.00106 

Part 5: Interpreting Cross-Validation Results

5.1 The Meaning of mean and std_err

When you run collect_metrics(), the output includes two important columns:

Column Meaning
mean The average value of the metric across the k folds
std_err The standard error of the metric across the folds

The mean is our best estimate of the model’s true performance.

The std_err tells us how much performance varies across different validation folds.

A smaller standard error means performance is more stable across different data splits.

5.2 Example Interpretation

Suppose we obtain the following output:

Metric Mean Standard Error
Accuracy 0.972 0.0023
Precision 0.751 0.0450
Recall 0.263 0.0312

This would mean:

  • Accuracy is very stable because the standard error is small.
  • Recall has more variability because the standard error is larger.
  • The model’s ability to catch defaulters changes more across different data splits.

5.3 How to Decide Which Model Is Better

Do not rely only on the mean.

You should also consider the standard error.

A useful rule of thumb is:

If the difference between two model means is large relative to the standard error, the difference is more likely to be meaningful.

We can compare the recall of the full model and the simple model.

# Extract recall means and standard errors
recall_full <- comparison |>
  filter(model == "Full: balance + income + student", .metric == "recall") |>
  pull(mean)

recall_simple <- comparison |>
  filter(model == "Simple: balance only", .metric == "recall") |>
  pull(mean)

se_full <- comparison |>
  filter(model == "Full: balance + income + student", .metric == "recall") |>
  pull(std_err)

se_simple <- comparison |>
  filter(model == "Simple: balance only", .metric == "recall") |>
  pull(std_err)

# Approximate standard error of the difference
se_diff <- sqrt(se_full^2 + se_simple^2)

diff <- recall_full - recall_simple

cat("Difference in recall:", round(diff, 4), "\n")
Difference in recall: 6e-04 
cat("Standard error of difference:", round(se_diff, 4), "\n")
Standard error of difference: 0.0012 
cat("Difference / SE:", round(diff / se_diff, 2), "\n")
Difference / SE: 0.53 

5.4 Interpretation

If the ratio Difference / SE is greater than 2 in absolute value, we have stronger evidence that the models truly differ.

If the ratio is small, the apparent difference may simply be due to random variation across folds.

5.5 Practical Takeaway for Economic Decision-Making

When choosing a model for a bank, we want reliable performance.

Cross-validation gives us two pieces of information:

  1. The average performance of the model.
  2. How much that performance varies across different samples.

A model with slightly lower mean recall but much smaller standard error may be preferable because its performance is more predictable.


Part 6: Cross-Validation for Regression

Now we apply the same idea to a regression problem.

We use the mtcars dataset.

The economic question is:

Can we predict a car’s fuel consumption based on its characteristics?

The outcome variable is mpg, or miles per gallon.

Because mpg is numerical, this is a regression problem.

6.1 Define the Linear Regression Model

lm_spec <- linear_reg() |>
  set_engine("lm") |>
  set_mode("regression")

6.2 Create Folds for the mtcars Dataset

set.seed(465)

folds_mpg <- vfold_cv(mtcars, v = 5)
folds_mpg
#  5-fold cross-validation 
# A tibble: 5 × 2
  splits         id   
  <list>         <chr>
1 <split [25/7]> Fold1
2 <split [25/7]> Fold2
3 <split [26/6]> Fold3
4 <split [26/6]> Fold4
5 <split [26/6]> Fold5

6.3 Evaluate a Regression Model with Cross-Validation

We estimate a regression model using:

  • wt: weight of the car
  • hp: horsepower
  • cyl: number of cylinders
cv_results_lm <- fit_resamples(
  lm_spec,
  mpg ~ wt + hp + cyl,
  resamples = folds_mpg,
  metrics = metric_set(rmse, rsq)
)

collect_metrics(cv_results_lm)
# A tibble: 2 × 6
  .metric .estimator  mean     n std_err .config        
  <chr>   <chr>      <dbl> <int>   <dbl> <chr>          
1 rmse    standard   2.66      5  0.318  pre0_mod0_post0
2 rsq     standard   0.840     5  0.0218 pre0_mod0_post0

6.4 Regression Metrics

Metric Meaning Better Value
RMSE Root Mean Squared Error; average prediction error in mpg Lower is better
R-squared Proportion of variation in mpg explained by the model Higher is better

If RMSE is lower, the model makes smaller prediction errors.

If R-squared is higher, the model explains more variation in fuel consumption.


Part 7: Compare Two Regression Models Using Cross-Validation

Now we compare two regression models.

Model Formula
Model A mpg ~ wt + hp + cyl
Model B mpg ~ wt + hp + cyl + am

The variable am indicates transmission type.

# Model A: wt + hp + cyl
cv_modelA <- fit_resamples(
  lm_spec,
  mpg ~ wt + hp + cyl,
  resamples = folds_mpg,
  metrics = metric_set(rmse, rsq)
)

# Model B: wt + hp + cyl + am
cv_modelB <- fit_resamples(
  lm_spec,
  mpg ~ wt + hp + cyl + am,
  resamples = folds_mpg,
  metrics = metric_set(rmse, rsq)
)

# Compare the two models
regression_comparison <- bind_rows(
  collect_metrics(cv_modelA) |> mutate(model = "Model A: wt + hp + cyl"),
  collect_metrics(cv_modelB) |> mutate(model = "Model B: wt + hp + cyl + am")
) |>
  select(model, .metric, mean, std_err)

regression_comparison
# A tibble: 4 × 4
  model                       .metric  mean std_err
  <chr>                       <chr>   <dbl>   <dbl>
1 Model A: wt + hp + cyl      rmse    2.66   0.318 
2 Model A: wt + hp + cyl      rsq     0.840  0.0218
3 Model B: wt + hp + cyl + am rmse    2.58   0.299 
4 Model B: wt + hp + cyl + am rsq     0.832  0.0297

7.1 Which Model Is Better?

For RMSE, lower is better.

For R-squared, higher is better.

If Model B has only a slightly lower RMSE but a larger standard error, the improvement may not be meaningful.

If the two models perform similarly, we may prefer the simpler model, Model A, to avoid unnecessary complexity.


Part 8: Your Turn – Simple Practice

Task 1: Compare Classification Models

Use 5-fold cross-validation to evaluate a logistic regression model with the following formula:

default ~ balance + student

This model excludes income.

Compare its cross-validated recall to:

  • the full model: default ~ balance + income + student
  • the balance-only model: default ~ balance
# Your code here

Write your observations here:

Which model catches the most defaulters?

Is the difference meaningful when you consider the standard errors?

Task 2: Interpret Regression Results

Using the mtcars dataset, cross-validate the following two models:

mpg ~ wt

and

mpg ~ wt + hp + cyl + disp + am

Compare their RMSE and R-squared.

# Your code here

Write your observations here:

Does the complex model appear to improve prediction?

Does it seem to overfit?

Which model would you recommend and why?


Summary: What We Learned Today

Concept Key Idea
Cross-validation Repeated train/validate splits; more reliable than a single split
k-fold CV Divides the data into k folds and validates each fold once
Comparing models Use CV means and standard errors to see if differences are meaningful
Standard error Measures how much performance varies across folds
Regression metrics RMSE and R-squared
Classification metrics Accuracy, precision, and recall
fit_resamples() Main tidymodels function for cross-validation

Take-Home Message

Cross-validation works for many types of predictive models, including classification and regression.

It gives a more honest estimate of performance than a single train/test split.

Use cross-validation to compare different models and choose the one that generalizes best to new data.

Always look at the standard error. A large standard error tells us that model performance is unstable.


Glossary of Functions Used

Function What it does
vfold_cv(data, v = 5) Creates k-fold cross-validation splits
logistic_reg() Defines a logistic regression model specification
linear_reg() Defines a linear regression model specification
set_engine("glm") Chooses glm() as the computational engine
set_engine("lm") Chooses lm() as the computational engine
set_mode("classification") Sets the problem type as classification
set_mode("regression") Sets the problem type as regression
fit_resamples() Performs cross-validation
metric_set() Selects evaluation metrics
collect_metrics() Extracts cross-validation results into a tidy table