BUA 345 - Lecture 17

More about Model Selection

Penelope Pooler Eisenbies

2026-03-16

Housekeeping

Upcoming Dates

HW 7 is available and is due on Wednesday, 3/18.

Demo videos were posted before Spring break.

HW 8 (Parts 1 and 2) are posted and due on Monday (3/23)
- Part 1 of HW 8 can be completed after today’s lecture.
- Part 2 of HW 8 pertains to Thursday’s lecture on Logistic Regression
Quiz 2 will be on Thursday, 3/26, in the classroom.
Practice Questions for Quiz 2 are Posted

More Housekeeping

This Week’s Plan 📋

Today (Tuesday 3/17)
- Quick review of model selection concepts
- Best Subsets method
- Measures of model fit
Thursday 3/19

Logistic Regression of binary response data

💥 Lecture 17 In-class Exercises - Q1 💥

Poll Everywhere - My User Name: penelopepoolereisenbies685

Review Question from Week 8 and HW 7.

If two predictor variables (X variables) in a model have a correlation of 0.85, what do you conclude?

Review of Animals Data

Question: What factors affect a mammal’s sleep duration?
Animals Data Notes:
- Population was limited to animals under 1000 pounds (two elephant species excluded).
- Natural log (LN) transformed variables were added to original data.
- Observations with missing values are removed below
- Working dataset has 49 observations (49 different species)

Animals Data

First 10 Rows of Animals Data
Species	TotalSleep	BodyWt	LNBodyWt	BrainWt	LNBrainWt	LifeSpan	LNLifeSpan	Gestation	PredF	ExposF	DangrF
Africangiantpouchedrat	8.3	1.00	0.00	6.6	1.89	4.5	1.50	42	3	1	3
Americanopossum	19.4	1.70	0.53	6.3	1.84	5.0	1.61	12	2	1	1
ArcticFox	12.5	3.39	1.22	44.5	3.80	14.0	2.64	60	1	1	1
Baboon	9.8	10.55	2.36	179.5	5.19	27.0	3.30	180	4	4	4
Bigbrownbat	19.7	0.02	-3.77	0.3	-1.20	19.0	2.94	35	1	1	1
Braziliantapir	6.2	160.00	5.08	169.0	5.13	30.4	3.41	392	4	5	4
Cat	14.5	3.30	1.19	25.6	3.24	28.0	3.33	63	1	2	1
Chimpanzee	9.7	52.16	3.95	440.0	6.09	50.0	3.91	230	1	1	1
Chinchilla	12.5	0.43	-0.86	6.4	1.86	7.0	1.95	112	5	4	4
Cow	3.9	465.00	6.14	423.0	6.05	30.0	3.40	281	5	5	5

Animals Data Dictionary - Description of Variables

Intuitvely, there is likely to be redundancy between Predation, Exposure, and Danger.

Variable	Type	Description
Species	Nominal	Name of Species
TotalSleep	Quantitative	Total Sleep
BodyWt	Quantitative	Average Body Weight in kilograms
LNBodyWt	Quantitative	Natural Log of Body Weight
BrainWt	Quantitative	Average Brain Weight in grams
LNBrainWt	Quantitative	Natural Log of Brain Weight
LifeSpan	Quantitative	Maximum Life Span in years
LNLifeSpan	Quantitative	Natural Log of Life Span
Gestation	Quantitative	Gestation Time in days
PredF	Ordinal	Predation Index (1=least likely to be prey)
ExposF	Ordinal	Sleep Exposure Index (1=least exposed)
DangrF	Ordinal	Overall Danger Index (1=least danger from other animals)

Multicollinearity Concerns in Animals Dataset

LNBodyWt and LNBrainWt (R = 0.95):
- These two predictors can not both be in the final model.
LNBrainWt and LNLifeSpan (R = 0.79):
- These two predictors ideally should not both be in the final model.
Predation (PredF) and Danger (DangrF) (R = 0.95):
- These two predictors can not both be in the final model.
Exposure (ExposF) and Danger (DangrF) (R = 0.78):
- These two predictors ideally should not both be in the final model.
NOTE: Students should know the commands for creating a correlation matrix with rounded values.
- See HW 7 and next two slides

Correlation Matrix of Quantitative Animal Variables

animals <- animals |> filter(!is.na(LifeSpan) & !is.na(Gestation)) # exclude missing values
animals |> select(TotalSleep, LNBodyWt, LNBrainWt, LNLifeSpan) |> cor() |> round(2) |> kable() |> kable_styling(full_width = F)

	TotalSleep	LNBodyWt	LNBrainWt	LNLifeSpan
TotalSleep	1.00	-0.56	-0.57	-0.37
LNBodyWt	-0.56	1.00	0.95	0.71
LNBrainWt	-0.57	0.95	1.00	0.79
LNLifeSpan	-0.37	0.71	0.79	1.00

Correlation Matrix of ordinal Variables

animals_ordinal |> cor() |> round(2) |> kable() |> kable_styling(full_width = F)

	Predation	Exposure	Danger
Predation	1.00	0.66	0.95
Exposure	0.66	1.00	0.78
Danger	0.95	0.78	1.00

Backwards Elimination - Animal Data Final Model

                         Model Summary                          
---------------------------------------------------------------
R                       0.857       RMSE                 2.329 
R-Squared               0.734       MSE                  5.423 
Adj. R-Squared          0.655       Coef. Var           25.223 
Pred R-Squared          0.547       AIC                247.894 
MAE                     1.857       SBC                272.488 
---------------------------------------------------------------
 RMSE: Root Mean Square Error 
 MSE: Mean Square Error 
 MAE: Mean Absolute Error 
 AIC: Akaike Information Criteria 
 SBC: Schwarz Bayesian Criteria 

                              ANOVA                                
------------------------------------------------------------------
               Sum of                                             
              Squares        DF    Mean Square      F        Sig. 
------------------------------------------------------------------
Regression    734.163        11         66.742    9.294    0.0000 
Residual      265.708        37          7.181                    
Total         999.871        48                                   
------------------------------------------------------------------

                                      Parameter Estimates                                       
-----------------------------------------------------------------------------------------------
            model      Beta    Std. Error    Std. Beta      t        Sig       lower     upper 
-----------------------------------------------------------------------------------------------
      (Intercept)     6.751         3.305                  2.043    0.048      0.054    13.448 
         LNBodyWt    -0.698         0.244       -0.442    -2.859    0.007     -1.192    -0.203 
       LNLifeSpan     2.855         1.133        0.591     2.519    0.016      0.559     5.151 
        Gestation    -0.020         0.006       -0.447    -3.285    0.002     -0.032    -0.008 
           PredF2    13.998         4.132        0.041     3.388    0.002      5.626    22.369 
           PredF3    11.883         5.514       -0.494     2.155    0.038      0.711    23.056 
           PredF4     2.654         4.102        0.021     0.647    0.522     -5.658    10.966 
           PredF5    -0.782         4.262       -0.316    -0.183    0.855     -9.418     7.855 
LNLifeSpan:PredF2    -5.367         1.478       -0.471    -3.632    0.001     -8.361    -2.373 
LNLifeSpan:PredF3    -7.390         3.141       -0.588    -2.352    0.024    -13.755    -1.025 
LNLifeSpan:PredF4    -0.941         1.356       -0.083    -0.694    0.492     -3.689     1.807 
LNLifeSpan:PredF5    -1.043         1.446       -0.091    -0.721    0.475     -3.973     1.887 
-----------------------------------------------------------------------------------------------

Model Selection Methods

Recall that in Multiple Linear Regression (MLR) the goal is to choose the simplest most accurate model, i.e. the ‘BEST’ set of independent variables
How do we decide which variables should be in our model?
There are many methods:
We’ve discussed Backward Elimination which can also be done manually in any software (not recommended).
Backward Elimination starts with all potential terms (including potential interaction terms) in the model and removes the least significant term for each step.
- This is referred to as starting with a full or saturated model.
Forward Selection: By default, this procedure starts with an empty model and adds the most significant term at each step until there are no more useful terms to add.
- Forward selection also needs to know what terms are in the full model.
Stepwise Selection: By default, this procedure starts with an empty model and then adds or removes a term for each step.

Comments about Model Selection Methods

Common Practice: Try multiple methods to develop preliminary final model and then tweak as needed.
Steps for model selection using multiple methods are similar to the steps for Backward Elimination (Week 8 Lectures)
Not all steps are ALWAYS required. It depends on how complex the data are.
In the following example, we only need to do part of Step 1 plus Steps 2, 3, and 6.
- For Step 1, we only need to examine correlations.
- In this case, Step 7 will be apparent.
- We can add model estimates to data for future interpretation (Step 8)

💥 Lecture 17 In-class Exercises - Q2 💥

Poll Everywhere - My User Name: penelopepoolereisenbies685

Which model selection method is characterized by starting with NO (0) terms in the model and then adding terms one by one until no more terms added are significant to the model?

Backward Elimination
Stepwise Selection
Forward Selection
Adjusted \(R^2\)

Steps for Model Selection Using Multiple Methods

Examine Matrix of Scatterplots and histograms and determine if any transformations are needed to linearize relationships between continuous predictors and response variable.

Also look at correlation matrix to check if there are pairs of variables to be concerned about.

Create a ‘saturated’ model with all potential predictor variables and interaction terms (Subjective!).
Use Backward Elimination, Forward Selection, and Stepwise Selection to find preliminary candidate models. (These are automated procedures!)

Carefully examine results to see where these candidate models agree and disagree.

Steps for Model Selection Cont’d

Examine predictors in preliminary candidate models to confirm they are not too highly correlated with each other.

If two predictor variables in any model have a correlation of 0.8 or greater, drop one of them.

Rerun model selection methods, if a candidate model is substantially changed (not always needed).
Compare model fit statistics from final candidate model from all three methods.
Decide on final candidate and make final modifications, if needed.
Interpret final model and use for estimation.

Forward Selection of Animals Data

Full Model:

# full model (subjective)
animals_full <- lm(TotalSleep ~ LNBodyWt + LNBrainWt + 
                     LNLifeSpan + Gestation + 
                     PredF + ExposF + DangrF + 
                     LNBodyWt*Gestation + LNLifeSpan*PredF + 
                     LNLifeSpan*ExposF + LNLifeSpan*DangrF, data=animals)

Forward Model Selection

(animals_FS <- ols_step_forward_p(animals_full, p_val = 0.1, progress = F))


                                Stepwise Summary                                 
-------------------------------------------------------------------------------
Step    Variable              AIC        SBC       SBIC        R2       Adj. R2 
-------------------------------------------------------------------------------
 0      Base Model          290.830    294.614    147.782    0.00000    0.00000 
 1      Gestation           268.856    274.532    123.817    0.38693    0.37388 
 2      DangrF              251.692    264.935     98.670    0.63316    0.59050 
 3      LNBrainWt           248.061    263.196     93.052    0.67298    0.62626 
 4      PredF               241.628    264.330     78.645    0.75641    0.69231 
 5      LNLifeSpan          233.996    258.589     69.041    0.79989    0.74039 
 6      LNLifeSpan:PredF    228.314    260.475     55.409    0.84864    0.77984 
 7      LNBodyWt            228.450    262.503     53.568    0.85429    0.78143 
 8      ExposF              229.245    270.865     46.411    0.87421    0.78437 
-------------------------------------------------------------------------------

Final Model Output 
------------------

                         Model Summary                          
---------------------------------------------------------------
R                       0.935       RMSE                 1.602 
R-Squared               0.874       MSE                  2.567 
Adj. R-Squared          0.784       Coef. Var           19.948 
Pred R-Squared           -Inf       AIC                229.245 
MAE                     1.168       SBC                270.865 
---------------------------------------------------------------
 RMSE: Root Mean Square Error 
 MSE: Mean Square Error 
 MAE: Mean Absolute Error 
 AIC: Akaike Information Criteria 
 SBC: Schwarz Bayesian Criteria 

                              ANOVA                                
------------------------------------------------------------------
               Sum of                                             
              Squares        DF    Mean Square      F        Sig. 
------------------------------------------------------------------
Regression    874.101        20         43.705     9.73    0.0000 
Residual      125.769        28          4.492                    
Total         999.871        48                                   
------------------------------------------------------------------

                                      Parameter Estimates                                        
------------------------------------------------------------------------------------------------
            model       Beta    Std. Error    Std. Beta      t        Sig       lower     upper 
------------------------------------------------------------------------------------------------
      (Intercept)      6.184         2.897                  2.135    0.042      0.250    12.118 
        Gestation     -0.020         0.007       -0.454    -2.789    0.009     -0.035    -0.005 
          DangrF2     -6.733         1.877       -0.641    -3.587    0.001    -10.578    -2.888 
          DangrF3     -8.462         3.579       -0.655    -2.364    0.025    -15.793    -1.130 
          DangrF4     -8.780         4.650       -0.718    -1.888    0.069    -18.305     0.745 
          DangrF5    -20.146         6.095       -1.561    -3.305    0.003    -32.632    -7.661 
        LNBrainWt     -0.180         0.684       -0.092    -0.264    0.794     -1.582     1.221 
           PredF2     14.954         3.672        1.462     4.072    0.000      7.431    22.477 
           PredF3     16.956         5.583        1.230     3.037    0.005      5.520    28.393 
           PredF4     11.583         5.230        0.897     2.215    0.035      0.871    22.295 
           PredF5      0.598         6.292        0.055     0.095    0.925    -12.290    13.486 
       LNLifeSpan      3.218         0.937        0.666     3.433    0.002      1.298     5.138 
         LNBodyWt     -0.803         0.511       -0.508    -1.572    0.127     -1.848     0.243 
          ExposF2     -0.082         1.180       -0.008    -0.070    0.945     -2.499     2.335 
          ExposF3      0.481         1.723        0.029     0.279    0.782     -3.049     4.011 
          ExposF4      3.183         1.854        0.213     1.716    0.097     -0.615     6.981 
          ExposF5      4.951         4.042        0.405     1.225    0.231     -3.328    13.231 
PredF2:LNLifeSpan     -3.401         1.455       -0.810    -2.337    0.027     -6.381    -0.420 
PredF3:LNLifeSpan     -5.334         4.249       -0.603    -1.255    0.220    -14.037     3.370 
PredF4:LNLifeSpan     -1.707         1.767       -0.373    -0.966    0.342     -5.327     1.913 
PredF5:LNLifeSpan      3.238         2.070        0.856     1.565    0.129     -1.002     7.478 
------------------------------------------------------------------------------------------------

Final Forward (and Stepwise) Selection Model

Drop DangrF due to multicollinearity with PredF
Drop LNBrainWt due to multicollinearity with LNBodyWt
Leave in ExposF(?) and compare to Backward Elimination Model
Stepwise Selection arrived at same model as Forward Selection.

                         Model Summary                          
---------------------------------------------------------------
R                       0.882       RMSE                 2.131 
R-Squared               0.777       MSE                  4.543 
Adj. R-Squared          0.676       Coef. Var           24.445 
Pred R-Squared          0.407       AIC                247.220 
MAE                     1.729       SBC                279.381 
---------------------------------------------------------------
 RMSE: Root Mean Square Error 
 MSE: Mean Square Error 
 MAE: Mean Absolute Error 
 AIC: Akaike Information Criteria 
 SBC: Schwarz Bayesian Criteria 

                              ANOVA                                
------------------------------------------------------------------
               Sum of                                             
              Squares        DF    Mean Square      F        Sig. 
------------------------------------------------------------------
Regression    777.272        15         51.818    7.682    0.0000 
Residual      222.599        33          6.745                    
Total         999.871        48                                   
------------------------------------------------------------------

                                      Parameter Estimates                                       
-----------------------------------------------------------------------------------------------
            model      Beta    Std. Error    Std. Beta      t        Sig       lower     upper 
-----------------------------------------------------------------------------------------------
      (Intercept)     7.151         3.217                  2.223    0.033      0.605    13.696 
         LNBodyWt    -0.796         0.251       -0.504    -3.167    0.003     -1.307    -0.285 
       LNLifeSpan     2.604         1.109        0.539     2.349    0.025      0.348     4.860 
        Gestation    -0.014         0.007       -0.313    -2.102    0.043     -0.027     0.000 
          ExposF2    -2.416         1.272       -0.223    -1.899    0.066     -5.004     0.173 
          ExposF3    -1.237         1.905       -0.075    -0.649    0.521     -5.113     2.640 
          ExposF4     1.096         2.009        0.073     0.545    0.589     -2.991     5.182 
          ExposF5    -2.379         2.864       -0.195    -0.831    0.412     -8.206     3.448 
           PredF2    12.917         4.095        0.173     3.154    0.003      4.585    21.249 
           PredF3    14.428         5.566       -0.597     2.592    0.014      3.103    25.752 
           PredF4     1.813         4.012       -0.019     0.452    0.654     -6.349     9.974 
           PredF5    -1.068         4.590       -0.206    -0.233    0.817    -10.407     8.270 
LNLifeSpan:PredF2    -4.405         1.530       -0.387    -2.880    0.007     -7.518    -1.293 
LNLifeSpan:PredF3    -8.959         3.193       -0.712    -2.806    0.008    -15.454    -2.463 
LNLifeSpan:PredF4    -0.814         1.418       -0.072    -0.574    0.570     -3.699     2.071 
LNLifeSpan:PredF5    -0.458         1.795       -0.040    -0.255    0.800     -4.110     3.195 
-----------------------------------------------------------------------------------------------

Comparing Model Results

Comparison Measures:
- Adj. \(R^2\): Higher value indicates better model fit
- C(p): Lower value indicates better model fit (Also referred to as Mallow’s C(p)).
- AIC: Lower value indicates better model fit (Akaike Information Criteria).
- RMSE: Lower value indicates better model fit (Root mean Square Error).
Decision is debatable but it seems worthwhile to include ExposF (Exposure).
Same data and models are covered in HW 8 - Part 1.

Method	Adjusted_R2	Mallows_Cp	AIC	RMSE
Backward Elimination	0.655	23.527	247.894	2.329
Forward/Stepwise Selection	0.676	23.654	247.220	2.131

Model Validation

How good is our model?
There are many ways to examine model fit.
Here are two straightforward ways:
- Check correlation between observed and estimated values
- Plot a scatterplot of observed and estimated values

Model Validation Plot (R = 0.88)

Wine Data - Model Selection Example

Can we determine what factors affect wine quality even if we KNOW NOTHING about wine cultivation and chemistry?

Maybe!

Since we have no prior knowledge, we start with a straightforward full model with all available predictors and no interactions.
- In practice, a consultant would be working with a wine expert to carefully determine a saturated model that includes all possible interactions.

Import Wine Data

Notice that all variables are numeric (<dbl> stands for decimal value).

wine <- read_csv("data/wine.csv", show_col_types = F) 
head(wine) |> kable() |> kable_styling(full_width = F)

Wine_Quality	Fixed_Acidity	Volatile_Acidity	Citric_Acidity	Residual_Sugar	Chlorides	Free_Sulphur_Dioxide	Total_Sulphur_Dioxide	Ph	Sulfate	Alcohol
5	9.3	0.48	0.29	2.1	0.127	6	16	3.22	0.72	11.2
6	9.1	0.22	0.24	2.1	0.078	1	28	3.41	0.87	10.3
7	7.9	0.34	0.36	1.9	0.065	5	10	3.27	0.54	11.2
5	7.2	1.00	0.00	3.0	0.102	7	16	3.43	0.46	10.0
7	11.9	0.43	0.66	3.1	0.109	10	23	3.15	0.85	10.4
5	7.2	0.49	0.24	2.2	0.070	5	36	3.33	0.48	9.4

Examine Correlation matrix for Multicollinearity

#  correlation matrix 
(cor_wine <- wine |> cor() |> round(2))

max(cor_wine[cor_wine < 1])

[1] 0.68

min(cor_wine)

[1] -0.7

Model Selection

We specify a full model using an easy shortcut:
- If all variables are included, you can use . instead of listing them all.
- This model specification is also used in HW 7.
The we do three model selection procedures:
- Backward Elimination (BE)
- Forward Selection (FS)
- Stepwise Selection (SS)

wine_full <- lm(Wine_Quality ~ ., data = wine)                 # specify full model
wine_BE <- ols_step_backward_p(wine_full, progress=F)          # backward elimination  
wine_FS <- ols_step_forward_p(wine_full, progress=F)           # forward selection
wine_SS <- ols_step_both_p(wine_full, progress=F)              # stepwise selection

Comparing Model Results

Look at the LAST step for each method to determine which method results in the best fit.
Comparison Measures:
- Adj. \(R^2\): Higher value indicates better model fit
- C(p): Lower value indicates better model fit (Also referred to as Mallow’s C(p)).
- AIC: Lower value indicates better model fit (Akaike Information Criteria).
- RMSE: Lower value indicates better model fit (Root mean Square Error).
By comparing these measures and accounting for our understanding of these procedures, we can determine that TWO of these methods arrived at the same model.

💥 Lecture 17 In-class Exercises - Q3 💥

Poll Everywhere - My User Name: penelopepoolereisenbies685

Which two model selection methods arrived at the same model for the wine data?

On the next few slides I will show pairs of stepwise summaries so you can compare them.

Backwards Elimination and Forward Selection

Backward Elimination

Forward Selection

Backwards Elimination and Stepwise Selection

Backward Elimination

Stepwise Selection

Forward Selection and Stepwise Selection

Forward Selection

Stepwise Selection

Wine Model Validation Plot (R = 0.58)

Best Subsets

Another model selection method is ‘Best Subsets’
- Output shows ‘Best’ one variable model, ‘Best’ two variable model, ‘Best’ three variable model, etc.
Each ‘Best’ model is determined by multiple Fit Statistics.
This method then examines which of these candidates is the overall best by comparing their fit statistics.
If we are fortunate, the optimal choice from Best Subsets matches a model already selected by Backward Elimination, or Forward or Stepwise Selection.
- In this case (and HW 8) we are fortunate.
NOTE: ols_step_best_subset command is VERY slow. You do not need to rerun it. Output is provided.

Some of the Best Subsets PLots

Reading Best Subsets Output

Tabular Output

Bottom table shows which model performs best, based on all of the fit statistics.
- For example, if model 3 (Three variable model) was best, it would have the HIGHEST Adjusted \(R^2\), Lowest C(p), and Lowest AIC.
  - We can see from bottom table that Model 3 is not the best.
- Model 7 IS the best because it does have the HIGHEST Adjusted R², Lowest C(p), and Lowest AIC.
Top table lists the variables in each of the ‘Best’ models.

Wine Best Subset Output

Preview of HW 8 - Part 1

Review model comparisons for Animal Data from first part of lecture.
Compare the optimal best subset model (Model 7) to the model found by both Backward Elimination and Forward Selection.
The goal is to determine to what extent they agree.
- Spoiler: They are in complete agreement which indicates that we have consensus on the model for these data.

Reminder of Upcoming Dates

Today’s Lecture (3/17) is the third and final lecture on model and variable selection.
HW 7 is due tomorrow, Wed., 3/18.
HW 8 is now posted and is due Monday, 3/23
- Part 1 pertains to Lectures 15-17
- Part 2 pertains to Lecture 18
Quiz 2 is on Thursday, March 26th, in the classroom
Practice Questions for Quiz 2 are Posted

Key Points from this Week

Regression modeling can be overwhelming because of all of the possible options.
- Automating part of the variable selection process is helpful.
- Trying different methods and comparing results is strongly recommended.
- Results from Automated processes are preliminary models that can (and should) be tinkered with.
- Once we have a final model we can add regression estimates and residuals to the dataset.
- Methods Covered: Backwards Elimination, Forward Selection, Stepwise Selection, Best Subsets
  - Compare results from multiple methods

To submit an Engagement Question or Comment about material from Lecture 17: Submit it by midnight today (day of lecture).