In multiple regression, including too many predictors can lead to unstable estimates (multicollinearity) and overfitting. Variable selection aims to find a subset of predictors that provides the best balance between model fit and simplicity (parsimony), often measured using criteria like Adjusted , AIC, or Mallows’ Cp.
In this lecture, we use the Ames Housing dataset to demonstrate three
classical variable selection methods (Forward, Backward, Stepwise) and
the more comprehensive Best Subset Selection. Our response variable
remains the log of the Sale Price (log_Sale_Price).
We load the necessary libraries, including leaps for
subset selection, and prepare our data by selecting a broader set of
potential predictors and handling missing values.
# Load Necessary Packages
library(tidyverse)
library(modeldata)
library(leaps) # For Best Subset Selection
# Load the Ames Housing data
data(ames)
# Select a broad set of key predictors and handle missing values
full_data <- ames %>%
drop_na() %>%
select(Gr_Liv_Area, Year_Built, Garage_Area, Total_Bsmt_SF, Garage_Cars, Year_Remod_Add,
First_Flr_SF, Full_Bath, Garage_Type, Fireplaces, Second_Flr_SF, Longitude, Foundation,
Lot_Area, Latitude, Heating_QC, TotRms_AbvGrd, Open_Porch_SF, Mas_Vnr_Area, Overall_Cond,
Sale_Price) %>%
mutate(log_Sale_Price = log(Sale_Price)) %>%
select(-Sale_Price)
cat("Predictors available for selection:\n")
Predictors available for selection:
print(names(full_data)[-which(names(full_data) == "log_Sale_Price")])
[1] "Gr_Liv_Area" "Year_Built" "Garage_Area" "Total_Bsmt_SF" "Garage_Cars"
[6] "Year_Remod_Add" "First_Flr_SF" "Full_Bath" "Garage_Type" "Fireplaces"
[11] "Second_Flr_SF" "Longitude" "Foundation" "Lot_Area" "Latitude"
[16] "Heating_QC" "TotRms_AbvGrd" "Open_Porch_SF" "Mas_Vnr_Area" "Overall_Cond"
cat("\nNumber of observations remaining:", nrow(full_data), "\n\n")
Number of observations remaining: 2930
# Define the Full Model (The upper boundary for selection methods)
full_model <- lm(log_Sale_Price ~ ., data = full_data)
# FULL MODEL SUMMARY
summary(full_model)
Call:
lm(formula = log_Sale_Price ~ ., data = full_data)
Residuals:
Min 1Q Median 3Q Max
-2.55316 -0.07719 -0.00017 0.08183 0.66645
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -6.398e+01 1.548e+01 -4.134 3.67e-05 ***
Gr_Liv_Area 1.962e-04 6.714e-05 2.922 0.003500 **
Year_Built 3.078e-03 2.345e-04 13.125 < 2e-16 ***
Garage_Area 9.995e-05 3.299e-05 3.030 0.002470 **
Total_Bsmt_SF 1.279e-04 1.463e-05 8.745 < 2e-16 ***
Garage_Cars 6.034e-02 1.000e-02 6.033 1.82e-09 ***
Year_Remod_Add 1.301e-03 2.361e-04 5.512 3.87e-08 ***
First_Flr_SF 1.105e-04 6.868e-05 1.608 0.107889
Full_Bath 2.588e-03 8.180e-03 0.316 0.751765
Garage_TypeBasment -5.109e-02 2.850e-02 -1.793 0.073109 .
Garage_TypeBuiltIn -2.182e-02 1.394e-02 -1.565 0.117788
Garage_TypeCarPort -1.669e-01 4.294e-02 -3.887 0.000104 ***
Garage_TypeDetchd -5.378e-02 9.067e-03 -5.931 3.36e-09 ***
Garage_TypeMore_Than_Two_Types -1.229e-01 3.569e-02 -3.442 0.000585 ***
Garage_TypeNo_Garage -5.289e-02 1.800e-02 -2.938 0.003328 **
Fireplaces 7.279e-02 5.629e-03 12.930 < 2e-16 ***
Second_Flr_SF 6.994e-05 6.721e-05 1.041 0.298164
Longitude -3.094e-01 1.353e-01 -2.287 0.022286 *
FoundationCBlock -2.609e-02 1.275e-02 -2.046 0.040809 *
FoundationPConc 2.620e-02 1.517e-02 1.727 0.084351 .
FoundationSlab -1.167e-01 3.089e-02 -3.779 0.000160 ***
FoundationStone 7.776e-03 5.041e-02 0.154 0.877419
FoundationWood -7.088e-03 7.559e-02 -0.094 0.925296
Lot_Area 2.095e-06 4.238e-07 4.945 8.06e-07 ***
Latitude 8.845e-01 1.782e-01 4.964 7.29e-07 ***
Heating_QCFair -5.770e-02 1.894e-02 -3.047 0.002335 **
Heating_QCGood -3.004e-02 9.373e-03 -3.205 0.001364 **
Heating_QCPoor -7.323e-01 9.510e-02 -7.701 1.84e-14 ***
Heating_QCTypical -5.630e-02 8.817e-03 -6.386 1.98e-10 ***
TotRms_AbvGrd -1.885e-03 3.383e-03 -0.557 0.577387
Open_Porch_SF 1.048e-05 4.881e-05 0.215 0.830000
Mas_Vnr_Area 6.314e-05 1.992e-05 3.170 0.001542 **
Overall_CondPoor 2.849e-02 8.083e-02 0.352 0.724565
Overall_CondFair 1.719e-01 6.620e-02 2.596 0.009475 **
Overall_CondBelow_Average 3.258e-01 6.436e-02 5.063 4.39e-07 ***
Overall_CondAverage 4.309e-01 6.290e-02 6.850 8.98e-12 ***
Overall_CondAbove_Average 4.757e-01 6.289e-02 7.565 5.19e-14 ***
Overall_CondGood 5.350e-01 6.304e-02 8.487 < 2e-16 ***
Overall_CondVery_Good 5.696e-01 6.420e-02 8.872 < 2e-16 ***
Overall_CondExcellent 6.544e-01 6.794e-02 9.633 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.1627 on 2890 degrees of freedom
Multiple R-squared: 0.8427, Adjusted R-squared: 0.8406
F-statistic: 397 on 39 and 2890 DF, p-value: < 2.2e-16
Best Subset Selection is the most rigorous approach. It fits all possible models (subsets) for a given number of predictors () and identifies the single best model for each based on criteria like or Adjusted .The regsubsets() function performs this exhaustive search.# We use regsubsets to find the best model for 1 predictor up to 10 predictors.
best_subset <- regsubsets(log_Sale_Price ~ .,
data = full_data,
nvmax = 10, # Look for models up to 10 variables
method = "exhaustive")
Warning: 1 linear dependencies found
# Plot 1: Mallows' Cp Criterion
# The ideal model minimizes Cp and has Cp approximately equal to p (the number of parameters).
plot(best_subset, scale = "Cp", main = "Mallows' Cp for Best Subset Selection")
# Identify the model that minimizes Mallows' Cp
best_model_summary <- summary(best_subset)
best_model_idx <- which.min(best_model_summary$cp) # Finds the row index (model size) with the smallest Cp
best_model_vars <- names(best_model_summary$which[best_model_idx,])[best_model_summary$which[best_model_idx,]]
best_model_vars <- best_model_vars[best_model_vars != "(Intercept)"]
cat(paste("\n--- BEST SUBSET SELECTION (Cp Criterion) ---\n"))
--- BEST SUBSET SELECTION (Cp Criterion) ---
cat(paste("Optimal Model Size (Min Cp):", best_model_idx, "Predictors\n"))
Optimal Model Size (Min Cp): 10 Predictors
cat("Selected Variables:", paste(best_model_vars, collapse = ", "), "\n")
Selected Variables: Gr_Liv_Area, Year_Built, Total_Bsmt_SF, Garage_Cars, Year_Remod_Add, Fireplaces, Overall_CondPoor, Overall_CondFair, Overall_CondBelow_Average, Overall_CondAverage
Mallows’ Cp attempts to balance model complexity with the variance-bias trade-off.
That is the first, and often the most important, diagnostic plot for Best Subset Selection!
The code
plot(best_subset, scale = "Cp", main = "Mallows' Cp for Best Subset Selection")
generates a plot based on Mallows’ \(C_p\) statistic.
Here is how your graduate students should interpret this crucial visualization:
What is \(C_p\)? Mallows’ \(C_p\) estimates the mean squared error of prediction for the model. It works by balancing two things: the model’s goodness-of-fit (its \(R^2\)) and the number of predictors (\(p\)) it uses. It is a penalized measure of error, aiming to find the sweet spot between bias (underfitting) and variance (overfitting).
The Goal: The primary objective when viewing this plot is to find the model size (\(p\)) that minimizes the \(C_p\) statistic. A smaller \(C_p\) suggests a more balanced model.
The \(C_p \approx p\) Rule: For a model to be considered a good fit, Mallows argued that the \(C_p\) value should be approximately equal to the number of parameters in the model (\(p\)). Since the number of parameters \(p\) is the number of predictors plus the intercept, you look for a row where the \(C_p\) value is close to its vertical position on the plot.
Therefore, when examining the Mallows’ \(C_p\) plot, you are looking for the point that is closest to the diagonal line \(C_p = p\), or simply the point that hits the minimum value. This will indicate the most optimal, parsimonious model size.
# Plot 2: Adjusted R-squared Criterion
# The ideal model maximizes Adjusted R-squared.
plot(best_subset, scale = "adjr2", main = "Adjusted R-squared for Best Subset Selection")
# We use the which.max() function on the adjr2 component of the summary object.
adjr2_model_idx <- which.max(best_model_summary$adjr2)
adjr2_max_value <- max(best_model_summary$adjr2)
adjr2_model_vars <- names(best_model_summary$which[adjr2_model_idx,])[best_model_summary$which[adjr2_model_idx,]]
adjr2_model_vars <- adjr2_model_vars[adjr2_model_vars != "(Intercept)"]
cat(paste("\n--- BEST SUBSET SELECTION (Adjusted R-squared Criterion) ---\n"))
--- BEST SUBSET SELECTION (Adjusted R-squared Criterion) ---
cat(paste("Maximum Adjusted R-squared:", round(adjr2_max_value, 4), "\n"))
Maximum Adjusted R-squared: 0.8185
cat(paste("Optimal Model Size (Max Adj R2):", adjr2_model_idx, "Predictors\n"))
Optimal Model Size (Max Adj R2): 10 Predictors
cat("Selected Variables:", paste(adjr2_model_vars, collapse = ", "), "\n")
Selected Variables: Gr_Liv_Area, Year_Built, Total_Bsmt_SF, Garage_Cars, Year_Remod_Add, Fireplaces, Overall_CondPoor, Overall_CondFair, Overall_CondBelow_Average, Overall_CondAverage
The
plot(best_subset, scale = "adjr2", main = "Adjusted R-squared for Best Subset Selection")
command generates a visualization that is critical for choosing the
optimal model size and composition after performing Best Subset
Selection.
Here is how you interpret this plot:
Structure: The plot is essentially a heatmap or grid where each row represents the best possible model for a specific number of predictors (\(p\)). The columns list all the potential explanatory variables from your full dataset.
Model Size (Y-Axis): The vertical axis (rows)
represents the number of predictors (\(p\)) used in the model, ranging from \(1\) up to the maximum number you specified
(nvmax=10 in the Canvas).
Variable Inclusion (Cells):
Interpretation Goal: Maximizing Adjusted \(R^2\):
Decision Rule:
In short, you are looking for the model (row) that incorporates the minimum number of variables necessary to achieve the maximum Adjusted \(R^2\).
By examining the plots and the minimum value, we can objectively choose the most efficient model.
The step() function in provides fast, automated
selection based on the Akaike Information Criterion (AIC), where the
goal is to find the model that minimizes AIC.
Forward Selection starts with a model containing only the intercept (the null model) and sequentially adds the single predictor that provides the greatest additional improvement to the fit (minimum AIC) until no further addition significantly improves the model.
# Start with the null model (intercept only)
null_model <- lm(log_Sale_Price ~ 1, data = full_data)
# Perform Forward Selection
model_forward <- step(null_model,
scope = list(lower = null_model, upper = full_model),
direction = "forward",
trace = 0)
cat("--- FORWARD SELECTION (Final Model Formula) ---\n")
--- FORWARD SELECTION (Final Model Formula) ---
print(formula(model_forward))
log_Sale_Price ~ Gr_Liv_Area + Year_Built + Overall_Cond + Total_Bsmt_SF +
Garage_Cars + Fireplaces + Heating_QC + Year_Remod_Add +
Garage_Type + Lot_Area + Foundation + Latitude + Garage_Area +
Mas_Vnr_Area + First_Flr_SF + Longitude
Backward Elimination starts with the full model (containing all predictors) and sequentially removes the single least significant predictor (highest P-value / maximum AIC increase) until all remaining predictors are deemed significant or until removal no longer improves the model.# Start with the full model
# Perform Backward Elimination
model_backward <- step(full_model,
direction = "backward",
trace = 0)
cat("--- BACKWARD ELIMINATION (Final Model Formula) ---\n")
--- BACKWARD ELIMINATION (Final Model Formula) ---
print(formula(model_backward))
log_Sale_Price ~ Gr_Liv_Area + Year_Built + Garage_Area + Total_Bsmt_SF +
Garage_Cars + Year_Remod_Add + First_Flr_SF + Garage_Type +
Fireplaces + Longitude + Foundation + Lot_Area + Latitude +
Heating_QC + Mas_Vnr_Area + Overall_Cond
Stepwise Selection is a hybrid approach that iteratively performs both forward and backward steps. At each stage, it may add or remove a predictor based on the AIC change.
# Start with the full model
# Perform Stepwise Selection
model_stepwise <- step(full_model,
direction = "both",
trace = 0)
cat("--- STEPWISE SELECTION (Final Model Formula) ---\n")
--- STEPWISE SELECTION (Final Model Formula) ---
print(formula(model_stepwise))
log_Sale_Price ~ Gr_Liv_Area + Year_Built + Garage_Area + Total_Bsmt_SF +
Garage_Cars + Year_Remod_Add + First_Flr_SF + Garage_Type +
Fireplaces + Longitude + Foundation + Lot_Area + Latitude +
Heating_QC + Mas_Vnr_Area + Overall_Cond
Observation: Notice that for this dataset, all three automated methods (Forward, Backward, and Stepwise) tend to converge on the same final model, which indicates a robust set of significant predictors.
Finally, we compare the summary statistics of the original Full Model versus the Selected Model (using the results from Backward Elimination) to highlight the benefit of variable selection.
# We will use the model chosen by Backward Elimination (model_backward) for comparison
final_selected_model <- model_backward
cat("\n\n--- MODEL CONTRAST: Full vs. Selected ---\n")
--- MODEL CONTRAST: Full vs. Selected ---
cat("--- Full Model Metrics ---\n")
--- Full Model Metrics ---
cat("R-squared (Full Model):", summary(full_model)$r.squared, "\n")
R-squared (Full Model): 0.8426886
cat("Adjusted R-squared (Full Model):", summary(full_model)$adj.r.squared, "\n")
Adjusted R-squared (Full Model): 0.8405657
cat("Number of Coefficients (Full Model):", length(coef(full_model)), "\n")
Number of Coefficients (Full Model): 40
cat("\n--- Selected Model Metrics ---\n")
--- Selected Model Metrics ---
cat("R-squared (Selected Model):", summary(final_selected_model)$r.squared, "\n")
R-squared (Selected Model): 0.8426038
cat("Adjusted R-squared (Selected Model):", summary(final_selected_model)$adj.r.squared, "\n")
Adjusted R-squared (Selected Model): 0.8407003
cat("Number of Coefficients (Selected Model):", length(coef(final_selected_model)), "\n")
Number of Coefficients (Selected Model): 36
A successful variable selection process, as demonstrated here, achieves a selected model with:
Similar or Higher Adjusted \(R^2\): This confirms that the removed variables were not contributing meaningfully to the overall predictive power of the model.
Fewer Parameters: The final model is more parsimonious, making the coefficients easier to interpret and potentially leading to better out-of-sample prediction (reduced risk of overfitting).
More Stable Coefficients: By removing redundant or non-significant variables, the remaining coefficients typically have smaller standard errors, increasing the precision of our estimates.