BUA 345 - Lecture 17
More about Model Selection
Penelope Pooler Eisenbies
2026-03-16
Housekeeping
Upcoming Dates
HW 7 is available and is due on Wednesday, 3/18 .
Demo videos were posted before Spring break.
HW 8 (Parts 1 and 2) are posted and due on Monday (3/23)
Quiz 2 will be on Thursday, 3/26, in the classroom.
Practice Questions for Quiz 2 are Posted
More Housekeeping
This Week’s Plan 📋
💥 Lecture 17 In-class Exercises - Q1 💥
Poll Everywhere - My User Name: penelopepoolereisenbies685
Review Question from Week 8 and HW 7.
If two predictor variables (X variables) in a model have a correlation of 0.85, what do you conclude?
Animals Data
First 10 Rows of Animals Data
Africangiantpouchedrat
8.3
1.00
0.00
6.6
1.89
4.5
1.50
42
3
1
3
Americanopossum
19.4
1.70
0.53
6.3
1.84
5.0
1.61
12
2
1
1
ArcticFox
12.5
3.39
1.22
44.5
3.80
14.0
2.64
60
1
1
1
Baboon
9.8
10.55
2.36
179.5
5.19
27.0
3.30
180
4
4
4
Bigbrownbat
19.7
0.02
-3.77
0.3
-1.20
19.0
2.94
35
1
1
1
Braziliantapir
6.2
160.00
5.08
169.0
5.13
30.4
3.41
392
4
5
4
Cat
14.5
3.30
1.19
25.6
3.24
28.0
3.33
63
1
2
1
Chimpanzee
9.7
52.16
3.95
440.0
6.09
50.0
3.91
230
1
1
1
Chinchilla
12.5
0.43
-0.86
6.4
1.86
7.0
1.95
112
5
4
4
Cow
3.9
465.00
6.14
423.0
6.05
30.0
3.40
281
5
5
5
Animals Data Dictionary - Description of Variables
Intuitvely, there is likely to be redundancy between Predation, Exposure, and Danger.
Species
Nominal
Name of Species
TotalSleep
Quantitative
Total Sleep
BodyWt
Quantitative
Average Body Weight in kilograms
LNBodyWt
Quantitative
Natural Log of Body Weight
BrainWt
Quantitative
Average Brain Weight in grams
LNBrainWt
Quantitative
Natural Log of Brain Weight
LifeSpan
Quantitative
Maximum Life Span in years
LNLifeSpan
Quantitative
Natural Log of Life Span
Gestation
Quantitative
Gestation Time in days
PredF
Ordinal
Predation Index (1=least likely to be prey)
ExposF
Ordinal
Sleep Exposure Index (1=least exposed)
DangrF
Ordinal
Overall Danger Index (1=least danger from other animals)
Multicollinearity Concerns in Animals Dataset
LNBodyWt and LNBrainWt (R = 0.95):
These two predictors can not both be in the final model.
LNBrainWt and LNLifeSpan (R = 0.79):
These two predictors ideally should not both be in the final model.
Predation (PredF) and Danger (DangrF) (R = 0.95):
These two predictors can not both be in the final model.
Exposure (ExposF) and Danger (DangrF) (R = 0.78):
These two predictors ideally should not both be in the final model.
NOTE: Students should know the commands for creating a correlation matrix with rounded values.
See HW 7 and next two slides
Correlation Matrix of Quantitative Animal Variables
animals <- animals |> filter (! is.na (LifeSpan) & ! is.na (Gestation)) # exclude missing values
animals |> select (TotalSleep, LNBodyWt, LNBrainWt, LNLifeSpan) |> cor () |> round (2 ) |> kable () |> kable_styling (full_width = F)
TotalSleep
1.00
-0.56
-0.57
-0.37
LNBodyWt
-0.56
1.00
0.95
0.71
LNBrainWt
-0.57
0.95
1.00
0.79
LNLifeSpan
-0.37
0.71
0.79
1.00
Correlation Matrix of ordinal Variables
animals_ordinal |> cor () |> round (2 ) |> kable () |> kable_styling (full_width = F)
Predation
1.00
0.66
0.95
Exposure
0.66
1.00
0.78
Danger
0.95
0.78
1.00
Backwards Elimination - Animal Data Final Model
Model Summary
---------------------------------------------------------------
R 0.857 RMSE 2.329
R-Squared 0.734 MSE 5.423
Adj. R-Squared 0.655 Coef. Var 25.223
Pred R-Squared 0.547 AIC 247.894
MAE 1.857 SBC 272.488
---------------------------------------------------------------
RMSE: Root Mean Square Error
MSE: Mean Square Error
MAE: Mean Absolute Error
AIC: Akaike Information Criteria
SBC: Schwarz Bayesian Criteria
ANOVA
------------------------------------------------------------------
Sum of
Squares DF Mean Square F Sig.
------------------------------------------------------------------
Regression 734.163 11 66.742 9.294 0.0000
Residual 265.708 37 7.181
Total 999.871 48
------------------------------------------------------------------
Parameter Estimates
-----------------------------------------------------------------------------------------------
model Beta Std. Error Std. Beta t Sig lower upper
-----------------------------------------------------------------------------------------------
(Intercept) 6.751 3.305 2.043 0.048 0.054 13.448
LNBodyWt -0.698 0.244 -0.442 -2.859 0.007 -1.192 -0.203
LNLifeSpan 2.855 1.133 0.591 2.519 0.016 0.559 5.151
Gestation -0.020 0.006 -0.447 -3.285 0.002 -0.032 -0.008
PredF2 13.998 4.132 0.041 3.388 0.002 5.626 22.369
PredF3 11.883 5.514 -0.494 2.155 0.038 0.711 23.056
PredF4 2.654 4.102 0.021 0.647 0.522 -5.658 10.966
PredF5 -0.782 4.262 -0.316 -0.183 0.855 -9.418 7.855
LNLifeSpan:PredF2 -5.367 1.478 -0.471 -3.632 0.001 -8.361 -2.373
LNLifeSpan:PredF3 -7.390 3.141 -0.588 -2.352 0.024 -13.755 -1.025
LNLifeSpan:PredF4 -0.941 1.356 -0.083 -0.694 0.492 -3.689 1.807
LNLifeSpan:PredF5 -1.043 1.446 -0.091 -0.721 0.475 -3.973 1.887
-----------------------------------------------------------------------------------------------
Model Selection Methods
Recall that in Multiple Linear Regression (MLR) the goal is to choose the simplest most accurate model, i.e. the ‘BEST’ set of independent variables
How do we decide which variables should be in our model?
There are many methods:
We’ve discussed Backward Elimination which can also be done manually in any software (not recommended).
Backward Elimination starts with all potential terms (including potential interaction terms) in the model and removes the least significant term for each step.
This is referred to as starting with a full or saturated model.
Forward Selection: By default, this procedure starts with an empty model and adds the most significant term at each step until there are no more useful terms to add.
Forward selection also needs to know what terms are in the full model.
Stepwise Selection: By default, this procedure starts with an empty model and then adds or removes a term for each step.
Common Practice: Try multiple methods to develop preliminary final model and then tweak as needed.
Steps for model selection using multiple methods are similar to the steps for Backward Elimination (Week 8 Lectures)
Not all steps are ALWAYS required. It depends on how complex the data are.
In the following example, we only need to do part of Step 1 plus Steps 2, 3, and 6.
For Step 1, we only need to examine correlations.
In this case, Step 7 will be apparent.
We can add model estimates to data for future interpretation (Step 8)
💥 Lecture 17 In-class Exercises - Q2 💥
Poll Everywhere - My User Name: penelopepoolereisenbies685
Which model selection method is characterized by starting with NO (0) terms in the model and then adding terms one by one until no more terms added are significant to the model?
Backward Elimination
Stepwise Selection
Forward Selection
Adjusted \(R^2\)
Steps for Model Selection Using Multiple Methods
Examine Matrix of Scatterplots and histograms and determine if any transformations are needed to linearize relationships between continuous predictors and response variable.
Also look at correlation matrix to check if there are pairs of variables to be concerned about.
Create a ‘saturated’ model with all potential predictor variables and interaction terms (Subjective!).
Use Backward Elimination , Forward Selection , and Stepwise Selection to find preliminary candidate models. (These are automated procedures!)
Carefully examine results to see where these candidate models agree and disagree.
Steps for Model Selection Cont’d
Examine predictors in preliminary candidate models to confirm they are not too highly correlated with each other.
If two predictor variables in any model have a correlation of 0.8 or greater, drop one of them.
Rerun model selection methods, if a candidate model is substantially changed (not always needed).
Compare model fit statistics from final candidate model from all three methods.
Decide on final candidate and make final modifications, if needed.
Interpret final model and use for estimation.
Forward Selection of Animals Data
Full Model:
# full model (subjective)
animals_full <- lm (TotalSleep ~ LNBodyWt + LNBrainWt +
LNLifeSpan + Gestation +
PredF + ExposF + DangrF +
LNBodyWt* Gestation + LNLifeSpan* PredF +
LNLifeSpan* ExposF + LNLifeSpan* DangrF, data= animals)
Forward Model Selection
(animals_FS <- ols_step_forward_p (animals_full, p_val = 0.1 , progress = F))
Stepwise Summary
-------------------------------------------------------------------------------
Step Variable AIC SBC SBIC R2 Adj. R2
-------------------------------------------------------------------------------
0 Base Model 290.830 294.614 147.782 0.00000 0.00000
1 Gestation 268.856 274.532 123.817 0.38693 0.37388
2 DangrF 251.692 264.935 98.670 0.63316 0.59050
3 LNBrainWt 248.061 263.196 93.052 0.67298 0.62626
4 PredF 241.628 264.330 78.645 0.75641 0.69231
5 LNLifeSpan 233.996 258.589 69.041 0.79989 0.74039
6 LNLifeSpan:PredF 228.314 260.475 55.409 0.84864 0.77984
7 LNBodyWt 228.450 262.503 53.568 0.85429 0.78143
8 ExposF 229.245 270.865 46.411 0.87421 0.78437
-------------------------------------------------------------------------------
Final Model Output
------------------
Model Summary
---------------------------------------------------------------
R 0.935 RMSE 1.602
R-Squared 0.874 MSE 2.567
Adj. R-Squared 0.784 Coef. Var 19.948
Pred R-Squared -Inf AIC 229.245
MAE 1.168 SBC 270.865
---------------------------------------------------------------
RMSE: Root Mean Square Error
MSE: Mean Square Error
MAE: Mean Absolute Error
AIC: Akaike Information Criteria
SBC: Schwarz Bayesian Criteria
ANOVA
------------------------------------------------------------------
Sum of
Squares DF Mean Square F Sig.
------------------------------------------------------------------
Regression 874.101 20 43.705 9.73 0.0000
Residual 125.769 28 4.492
Total 999.871 48
------------------------------------------------------------------
Parameter Estimates
------------------------------------------------------------------------------------------------
model Beta Std. Error Std. Beta t Sig lower upper
------------------------------------------------------------------------------------------------
(Intercept) 6.184 2.897 2.135 0.042 0.250 12.118
Gestation -0.020 0.007 -0.454 -2.789 0.009 -0.035 -0.005
DangrF2 -6.733 1.877 -0.641 -3.587 0.001 -10.578 -2.888
DangrF3 -8.462 3.579 -0.655 -2.364 0.025 -15.793 -1.130
DangrF4 -8.780 4.650 -0.718 -1.888 0.069 -18.305 0.745
DangrF5 -20.146 6.095 -1.561 -3.305 0.003 -32.632 -7.661
LNBrainWt -0.180 0.684 -0.092 -0.264 0.794 -1.582 1.221
PredF2 14.954 3.672 1.462 4.072 0.000 7.431 22.477
PredF3 16.956 5.583 1.230 3.037 0.005 5.520 28.393
PredF4 11.583 5.230 0.897 2.215 0.035 0.871 22.295
PredF5 0.598 6.292 0.055 0.095 0.925 -12.290 13.486
LNLifeSpan 3.218 0.937 0.666 3.433 0.002 1.298 5.138
LNBodyWt -0.803 0.511 -0.508 -1.572 0.127 -1.848 0.243
ExposF2 -0.082 1.180 -0.008 -0.070 0.945 -2.499 2.335
ExposF3 0.481 1.723 0.029 0.279 0.782 -3.049 4.011
ExposF4 3.183 1.854 0.213 1.716 0.097 -0.615 6.981
ExposF5 4.951 4.042 0.405 1.225 0.231 -3.328 13.231
PredF2:LNLifeSpan -3.401 1.455 -0.810 -2.337 0.027 -6.381 -0.420
PredF3:LNLifeSpan -5.334 4.249 -0.603 -1.255 0.220 -14.037 3.370
PredF4:LNLifeSpan -1.707 1.767 -0.373 -0.966 0.342 -5.327 1.913
PredF5:LNLifeSpan 3.238 2.070 0.856 1.565 0.129 -1.002 7.478
------------------------------------------------------------------------------------------------
Final Forward (and Stepwise) Selection Model
Drop DangrF due to multicollinearity with PredF
Drop LNBrainWt due to multicollinearity with LNBodyWt
Leave in ExposF(?) and compare to Backward Elimination Model
Stepwise Selection arrived at same model as Forward Selection.
Model Summary
---------------------------------------------------------------
R 0.882 RMSE 2.131
R-Squared 0.777 MSE 4.543
Adj. R-Squared 0.676 Coef. Var 24.445
Pred R-Squared 0.407 AIC 247.220
MAE 1.729 SBC 279.381
---------------------------------------------------------------
RMSE: Root Mean Square Error
MSE: Mean Square Error
MAE: Mean Absolute Error
AIC: Akaike Information Criteria
SBC: Schwarz Bayesian Criteria
ANOVA
------------------------------------------------------------------
Sum of
Squares DF Mean Square F Sig.
------------------------------------------------------------------
Regression 777.272 15 51.818 7.682 0.0000
Residual 222.599 33 6.745
Total 999.871 48
------------------------------------------------------------------
Parameter Estimates
-----------------------------------------------------------------------------------------------
model Beta Std. Error Std. Beta t Sig lower upper
-----------------------------------------------------------------------------------------------
(Intercept) 7.151 3.217 2.223 0.033 0.605 13.696
LNBodyWt -0.796 0.251 -0.504 -3.167 0.003 -1.307 -0.285
LNLifeSpan 2.604 1.109 0.539 2.349 0.025 0.348 4.860
Gestation -0.014 0.007 -0.313 -2.102 0.043 -0.027 0.000
ExposF2 -2.416 1.272 -0.223 -1.899 0.066 -5.004 0.173
ExposF3 -1.237 1.905 -0.075 -0.649 0.521 -5.113 2.640
ExposF4 1.096 2.009 0.073 0.545 0.589 -2.991 5.182
ExposF5 -2.379 2.864 -0.195 -0.831 0.412 -8.206 3.448
PredF2 12.917 4.095 0.173 3.154 0.003 4.585 21.249
PredF3 14.428 5.566 -0.597 2.592 0.014 3.103 25.752
PredF4 1.813 4.012 -0.019 0.452 0.654 -6.349 9.974
PredF5 -1.068 4.590 -0.206 -0.233 0.817 -10.407 8.270
LNLifeSpan:PredF2 -4.405 1.530 -0.387 -2.880 0.007 -7.518 -1.293
LNLifeSpan:PredF3 -8.959 3.193 -0.712 -2.806 0.008 -15.454 -2.463
LNLifeSpan:PredF4 -0.814 1.418 -0.072 -0.574 0.570 -3.699 2.071
LNLifeSpan:PredF5 -0.458 1.795 -0.040 -0.255 0.800 -4.110 3.195
-----------------------------------------------------------------------------------------------
Comparing Model Results
Backward Elimination
0.655
23.527
247.894
2.329
Forward/Stepwise Selection
0.676
23.654
247.220
2.131
Model Validation Plot (R = 0.88)
Wine Data - Model Selection Example
Can we determine what factors affect wine quality even if we KNOW NOTHING about wine cultivation and chemistry?
Maybe!
Import Wine Data
Notice that all variables are numeric (<dbl> stands for decimal value).
wine <- read_csv ("data/wine.csv" , show_col_types = F)
head (wine) |> kable () |> kable_styling (full_width = F)
5
9.3
0.48
0.29
2.1
0.127
6
16
3.22
0.72
11.2
6
9.1
0.22
0.24
2.1
0.078
1
28
3.41
0.87
10.3
7
7.9
0.34
0.36
1.9
0.065
5
10
3.27
0.54
11.2
5
7.2
1.00
0.00
3.0
0.102
7
16
3.43
0.46
10.0
7
11.9
0.43
0.66
3.1
0.109
10
23
3.15
0.85
10.4
5
7.2
0.49
0.24
2.2
0.070
5
36
3.33
0.48
9.4
Examine Correlation matrix for Multicollinearity
# correlation matrix
(cor_wine <- wine |> cor () |> round (2 ))
max (cor_wine[cor_wine < 1 ])
Model Selection
wine_full <- lm (Wine_Quality ~ ., data = wine) # specify full model
wine_BE <- ols_step_backward_p (wine_full, progress= F) # backward elimination
wine_FS <- ols_step_forward_p (wine_full, progress= F) # forward selection
wine_SS <- ols_step_both_p (wine_full, progress= F) # stepwise selection
Comparing Model Results
Look at the LAST step for each method to determine which method results in the best fit.
Comparison Measures:
Adj. \(R^2\) : Higher value indicates better model fit
C(p): Lower value indicates better model fit (Also referred to as Mallow’s C(p)).
AIC: Lower value indicates better model fit (Akaike Information Criteria).
RMSE: Lower value indicates better model fit (Root mean Square Error).
By comparing these measures and accounting for our understanding of these procedures, we can determine that TWO of these methods arrived at the same model.
💥 Lecture 17 In-class Exercises - Q3 💥
Poll Everywhere - My User Name: penelopepoolereisenbies685
Which two model selection methods arrived at the same model for the wine data?
On the next few slides I will show pairs of stepwise summaries so you can compare them.
Backwards Elimination and Forward Selection
Backward Elimination
Forward Selection
Backwards Elimination and Stepwise Selection
Backward Elimination
Stepwise Selection
Forward Selection and Stepwise Selection
Forward Selection
Stepwise Selection
Wine Model Validation Plot (R = 0.58)
Best Subsets
Another model selection method is ‘Best Subsets’
Output shows ‘Best’ one variable model, ‘Best’ two variable model, ‘Best’ three variable model, etc.
Each ‘Best’ model is determined by multiple Fit Statistics .
This method then examines which of these candidates is the overall best by comparing their fit statistics.
If we are fortunate, the optimal choice from Best Subsets matches a model already selected by Backward Elimination, or Forward or Stepwise Selection.
In this case (and HW 8) we are fortunate.
NOTE: ols_step_best_subset command is VERY slow. You do not need to rerun it. Output is provided.
Some of the Best Subsets PLots
Reading Best Subsets Output
Tabular Output
Bottom table shows which model performs best, based on all of the fit statistics.
For example, if model 3 (Three variable model) was best, it would have the HIGHEST Adjusted \(R^2\) , Lowest C(p), and Lowest AIC.
We can see from bottom table that Model 3 is not the best .
Model 7 IS the best because it does have the HIGHEST Adjusted R2 , Lowest C(p), and Lowest AIC.
Top table lists the variables in each of the ‘Best’ models.
Preview of HW 8 - Part 1
Review model comparisons for Animal Data from first part of lecture.
Compare the optimal best subset model (Model 7) to the model found by both Backward Elimination and Forward Selection.
The goal is to determine to what extent they agree.
Spoiler: They are in complete agreement which indicates that we have consensus on the model for these data.
Reminder of Upcoming Dates
Today’s Lecture (3/17) is the third and final lecture on model and variable selection.
HW 7 is due tomorrow, Wed., 3/18 .
HW 8 is now posted and is due Monday, 3/23
Quiz 2 is on Thursday, March 26th, in the classroom
Practice Questions for Quiz 2 are Posted
Key Points from this Week
To submit an Engagement Question or Comment about material from Lecture 17: Submit it by midnight today (day of lecture).