ECON 465 – Week 9 Lab: Cross-Validation & Model Selection
Author
Gül Ertan Özgüzer
Published
May 8, 2025
Lab Objectives
By the end of this lab, you will be able to:
Understand why cross-validation is better than a single train/test split
Implement k-fold cross-validation for classification and regression
Use cross-validation to compare different models
Interpret cross-validation results using mean performance and standard error
Choose the best model based on more reliable evidence
The Economic Questions
We will use two different datasets in this lab.
Classification: We use the Default dataset to predict whether a borrower defaults on a credit card payment.
Regression: We use the mtcars dataset to predict a car’s fuel consumption, measured by miles per gallon, based on its characteristics.
Cross-validation works in a similar way for both types of problems. The main difference is that we use different evaluation metrics.
For classification, we use metrics such as accuracy, precision, and recall.
For regression, we use metrics such as RMSE and R-squared.
Part 1: Why Cross-Validation?
1.1 The Problem with a Single Train/Test Split
Last week, we split the data once into two parts:
80% training data
20% test data
This is useful, but it has an important limitation.
The test-set performance depends on which 20% of the observations we happened to leave out.
A different random split could give a different accuracy, recall, or RMSE.
Therefore, a single train/test split may give us a performance estimate that is too optimistic or too pessimistic.
1.2 The Solution: Cross-Validation
Cross-validation repeats the train/test idea many times.
Instead of using only one test set, it creates several validation sets and averages the results.
This gives a more reliable estimate of how well the model will perform on new data.
1.3 How k-Fold Cross-Validation Works
In k-fold cross-validation:
We divide the data into k equal-sized folds.
We train the model on k - 1 folds.
We evaluate the model on the remaining fold.
We repeat this process k times.
We average the k validation scores.
For example, in 5-fold cross-validation, the data is divided into 5 parts.
Each part is used once as a validation set, and the other 4 parts are used as the training set.
The final result is the average performance across the 5 validation folds.
1.4 Why Is Cross-Validation Useful?
Cross-validation is useful because:
Every observation is used for validation exactly once.
Every observation is used for training several times.
The result is less dependent on one random split.
It gives both an average performance measure and a measure of variability.
Part 2: Prepare the Data for Classification
We first use the Default dataset from the ISLR package.
The economic question is:
Can we predict whether a borrower will default on a credit card payment?
Dataset: Default
Variable
Description
default
Whether the customer defaulted: Yes or No
student
Whether the customer is a student: Yes or No
balance
Average credit card balance in USD
income
Annual income in USD
# Load packageslibrary(tidyverse)library(ISLR)library(tidymodels)# Load the Default datasetdata("Default")# Prepare the dataDefault <- Default |>mutate(default =factor(default, levels =c("No", "Yes")),student =factor(student, levels =c("No", "Yes")) )# Look at the dataglimpse(Default)
Rows: 10,000
Columns: 4
$ default <fct> No, No, No, No, No, No, No, No, No, No, No, No, No, No, No, No…
$ student <fct> No, Yes, No, No, No, Yes, No, Yes, No, No, Yes, Yes, No, No, N…
$ balance <dbl> 729.5265, 817.1804, 1073.5492, 529.2506, 785.6559, 919.5885, 8…
$ income <dbl> 44361.625, 12106.135, 31767.139, 35704.494, 38463.496, 7491.55…
# How many customers defaulted?table(Default$default)
No Yes
9667 333
Only about 3.3% of customers defaulted. This means the dataset is imbalanced.
Most customers did not default.
This is important because accuracy can be misleading in imbalanced datasets.
For example, a model that predicts “No default” for everyone would already have very high accuracy, but it would be useless for identifying actual defaulters.
cv_results_logistic <-fit_resamples( # Worhorse function of CV logistic_spec, # Model definition (logistic regression) default ~ balance + income + student, # Formula (outcome ~ predictors)resamples = folds, # The 5‑fold CV object created earliermetrics =metric_set(accuracy, precision, recall) # What to measure)collect_metrics(cv_results_logistic)
# A tibble: 2 × 6
.metric .estimator mean n std_err .config
<chr> <chr> <dbl> <int> <dbl> <chr>
1 rmse standard 2.66 5 0.318 pre0_mod0_post0
2 rsq standard 0.840 5 0.0218 pre0_mod0_post0
6.4 Regression Metrics
Metric
Meaning
Better Value
RMSE
Root Mean Squared Error; average prediction error in mpg
Lower is better
R-squared
Proportion of variation in mpg explained by the model
Higher is better
If RMSE is lower, the model makes smaller prediction errors.
If R-squared is higher, the model explains more variation in fuel consumption.
Part 7: Compare Two Regression Models Using Cross-Validation
Now we compare two regression models.
Model
Formula
Model A
mpg ~ wt + hp + cyl
Model B
mpg ~ wt + hp + cyl + am
The variable am indicates transmission type.
# Model A: wt + hp + cylcv_modelA <-fit_resamples( lm_spec, mpg ~ wt + hp + cyl,resamples = folds_mpg,metrics =metric_set(rmse, rsq))# Model B: wt + hp + cyl + amcv_modelB <-fit_resamples( lm_spec, mpg ~ wt + hp + cyl + am,resamples = folds_mpg,metrics =metric_set(rmse, rsq))# Compare the two modelsregression_comparison <-bind_rows(collect_metrics(cv_modelA) |>mutate(model ="Model A: wt + hp + cyl"),collect_metrics(cv_modelB) |>mutate(model ="Model B: wt + hp + cyl + am")) |>select(model, .metric, mean, std_err)regression_comparison
# A tibble: 4 × 4
model .metric mean std_err
<chr> <chr> <dbl> <dbl>
1 Model A: wt + hp + cyl rmse 2.66 0.318
2 Model A: wt + hp + cyl rsq 0.840 0.0218
3 Model B: wt + hp + cyl + am rmse 2.58 0.299
4 Model B: wt + hp + cyl + am rsq 0.832 0.0297
7.1 Which Model Is Better?
For RMSE, lower is better.
For R-squared, higher is better.
If Model B has only a slightly lower RMSE but a larger standard error, the improvement may not be meaningful.
If the two models perform similarly, we may prefer the simpler model, Model A, to avoid unnecessary complexity.
Part 8: Your Turn – Simple Practice
Task 1: Compare Classification Models
Use 5-fold cross-validation to evaluate a logistic regression model with the following formula:
default ~ balance + student
This model excludes income.
Compare its cross-validated recall to:
the full model: default ~ balance + income + student
the balance-only model: default ~ balance
# Your code here
Write your observations here:
Which model catches the most defaulters?
Is the difference meaningful when you consider the standard errors?
Task 2: Interpret Regression Results
Using the mtcars dataset, cross-validate the following two models:
mpg ~ wt
and
mpg ~ wt + hp + cyl + disp + am
Compare their RMSE and R-squared.
# Your code here
Write your observations here:
Does the complex model appear to improve prediction?
Does it seem to overfit?
Which model would you recommend and why?
Summary: What We Learned Today
Concept
Key Idea
Cross-validation
Repeated train/validate splits; more reliable than a single split
k-fold CV
Divides the data into k folds and validates each fold once
Comparing models
Use CV means and standard errors to see if differences are meaningful
Standard error
Measures how much performance varies across folds
Regression metrics
RMSE and R-squared
Classification metrics
Accuracy, precision, and recall
fit_resamples()
Main tidymodels function for cross-validation
Take-Home Message
Cross-validation works for many types of predictive models, including classification and regression.
It gives a more honest estimate of performance than a single train/test split.
Use cross-validation to compare different models and choose the one that generalizes best to new data.
Always look at the standard error. A large standard error tells us that model performance is unstable.
Glossary of Functions Used
Function
What it does
vfold_cv(data, v = 5)
Creates k-fold cross-validation splits
logistic_reg()
Defines a logistic regression model specification
linear_reg()
Defines a linear regression model specification
set_engine("glm")
Chooses glm() as the computational engine
set_engine("lm")
Chooses lm() as the computational engine
set_mode("classification")
Sets the problem type as classification
set_mode("regression")
Sets the problem type as regression
fit_resamples()
Performs cross-validation
metric_set()
Selects evaluation metrics
collect_metrics()
Extracts cross-validation results into a tidy table