BUA 345 - Lecture 16

Introduction to Model Selection Continued

Author

Penelope Pooler Eisenbies

Published

March 5, 2025

Housekeeping

HW 6 was due 3/5/2025 - 2 day grace period

Demo videos were posted on Sunday morning

HW 7 is available and is due on Wednesday, 3/19.

Quiz 2 will be on 4/1/2025 - Date has changed and syllabus has been updated.

Today’s plan

Implementing partially automated model selection.
- Backward Elimination for Model Selection
- HW 7 Demo
- Model Selection using Multiple Methods

In-class Polling (Session ID: bua345s25)

Animals Data

Species	TotalSleep	BodyWt	LNBodyWt	BrainWt	LNBrainWt	LifeSpan	LNLifeSpan	Gestation	Predation	Exposure	Danger
Africangiantpouchedrat	8.3	1.00	0.00	6.6	1.89	4.5	1.50	42	3	1	3
Americanopossum	19.4	1.70	0.53	6.3	1.84	5.0	1.61	12	2	1	1
ArcticFox	12.5	3.39	1.22	44.5	3.80	14.0	2.64	60	1	1	1
Baboon	9.8	10.55	2.36	179.5	5.19	27.0	3.30	180	4	4	4
Bigbrownbat	19.7	0.02	-3.77	0.3	-1.20	19.0	2.94	35	1	1	1
Braziliantapir	6.2	160.00	5.08	169.0	5.13	30.4	3.41	392	4	5	4

Animals Data Dictionary - Description of Variables

Variable	Type	Description
Species	Nominal	Name of Species
TotalSleep	Quantitative	Total Sleep
BodyWt	Quantitative	Average Body Weight in kilograms
LNBodyWt	Quantitative	Natural Log of Body Weight
BrainWt	Quantitative	Average Brain Weight in grams
LNBrainWt	Quantitative	Natural Log of Brain Weight
LifeSpan	Quantitative	Maximum Life Span in years
LNLifeSpan	Quantitative	Natural Log of Life Span
Gestation	Quantitative	Gestation Time in days
Predation	Ordinal	Predation Index (1=least likely to be prey)
Exposure	Ordinal	Sleep Exposure Index (1=least exposed)
Danger	Ordinal	Overall Danger Index (1=least danger from other animals)

Lecture 16 In-class Exercises - Q1

Session ID: bua345s25

Which two ordinal categorical predictor variables appear to be multicollinear, i.e., highly correlated?

Code

```{r matrix 3, echo=T}
animal_mat3 <- animals |> select(TotalSleep, 
                                 Predation, 
                                 Exposure, 
                                 Danger)

animal_mat3 |> cor() |> round(2)
```

           TotalSleep Predation Exposure Danger
TotalSleep       1.00     -0.48    -0.63  -0.63
Predation       -0.48      1.00     0.66   0.95
Exposure        -0.63      0.66     1.00   0.78
Danger          -0.63      0.95     0.78   1.00

Scatterplot Matrix

Visual Representation of Correlations

Backward Elimination

Data examination and transformations completed
Create a full ‘saturated’ model with all potential predictor variables and interaction terms (This is subjective).

Code

```{r animals full model, echo=T}
# convert ordinal variables to factors
animals <- animals |>       
  mutate(PredF = factor(Predation), 
         ExposF = factor(Exposure), 
         DangrF=factor(Danger))

# full model (subjective)
animals_full <- lm(TotalSleep ~ LNBodyWt + LNBrainWt + 
                     LNLifeSpan + Gestation + 
                     PredF + ExposF + DangrF + 
                     LNBodyWt*Gestation + LNLifeSpan*PredF + 
                     LNLifeSpan*ExposF + LNLifeSpan*DangrF, data=animals)
```

Backward Elimination Cont’d

Use ‘Backward Elimination’ to pare full model down to a preliminary model.
- We cast a wide net by specifying that terms will remain in model if p-value < 0.1.

Code

```{r animals model backward elim and output, warning=F, echo=T}
(animals_BE <- ols_step_backward_p(animals_full, p_val = 0.1, progress = F))
```

Note: model has aliased coefficients
      sums of squares computed by model comparison


                                 Stepwise Summary                                 
--------------------------------------------------------------------------------
Step    Variable                AIC        SBC       SBIC       R2       Adj. R2 
--------------------------------------------------------------------------------
 0      Full Model            240.461    299.107    39.729    0.89048    0.73714 
 1      LNLifeSpan:DangrF     232.882    283.961    40.124    0.88953    0.76946 
 2      LNBrainWt             231.276    280.464    40.494    0.88864    0.77728 
 3      LNLifeSpan:ExposF     229.366    270.986    46.531    0.87390    0.78383 
 4      LNBodyWt:Gestation    227.366    267.095    46.512    0.87390    0.79129 
 5      ExposF                227.508    259.669    54.605    0.85111    0.78343 
--------------------------------------------------------------------------------

Final Model Output 
------------------

                         Model Summary                          
---------------------------------------------------------------
R                       0.923       RMSE                 1.743 
R-Squared               0.851       MSE                  3.038 
Adj. R-Squared          0.783       Coef. Var           19.991 
Pred R-Squared          0.660       AIC                227.508 
MAE                     1.283       SBC                259.669 
---------------------------------------------------------------
 RMSE: Root Mean Square Error 
 MSE: Mean Square Error 
 MAE: Mean Absolute Error 
 AIC: Akaike Information Criteria 
 SBC: Schwarz Bayesian Criteria 

                               ANOVA                                
-------------------------------------------------------------------
               Sum of                                              
              Squares        DF    Mean Square      F         Sig. 
-------------------------------------------------------------------
Regression    851.000        15         56.733    12.576    0.0000 
Residual      148.871        33          4.511                     
Total         999.871        48                                    
-------------------------------------------------------------------

                                      Parameter Estimates                                        
------------------------------------------------------------------------------------------------
            model       Beta    Std. Error    Std. Beta      t        Sig       lower     upper 
------------------------------------------------------------------------------------------------
      (Intercept)      6.324         2.635                  2.401    0.022      0.964    11.684 
         LNBodyWt     -0.813         0.202       -0.515    -4.027    0.000     -1.224    -0.402 
       LNLifeSpan      3.009         0.909        0.622     3.311    0.002      1.160     4.858 
        Gestation     -0.019         0.005       -0.424    -3.736    0.001     -0.029    -0.009 
           PredF2     14.639         3.291        1.431     4.448    0.000      7.944    21.335 
           PredF3     17.053         5.383        1.237     3.168    0.003      6.101    28.005 
           PredF4     11.414         4.830        0.884     2.363    0.024      1.587    21.241 
           PredF5      0.722         6.052        0.067     0.119    0.906    -11.592    13.035 
          DangrF2     -6.810         1.746       -0.648    -3.900    0.000    -10.363    -3.258 
          DangrF3     -8.701         3.444       -0.674    -2.527    0.016    -15.708    -1.695 
          DangrF4     -7.957         4.344       -0.651    -1.832    0.076    -16.794     0.881 
          DangrF5    -16.325         4.456       -1.265    -3.664    0.001    -25.390    -7.259 
LNLifeSpan:PredF2     -3.334         1.299       -0.794    -2.567    0.015     -5.976    -0.692 
LNLifeSpan:PredF3     -5.444         3.940       -0.615    -1.382    0.176    -13.459     2.571 
LNLifeSpan:PredF4     -1.160         1.537       -0.253    -0.755    0.456     -4.286     1.967 
LNLifeSpan:PredF5      3.357         1.868        0.887     1.797    0.081     -0.443     7.157 
------------------------------------------------------------------------------------------------

Backward Elimination - Preliminary Model

Note that each category of each factor variable is shown making model look more complex than it is.

Backward Elimination - Next Steps

Examine predictors in preliminary model to confirm they are not too highly correlated with each other.
- If correlation for two variables, \(R_{XY} \geq 0.8\), then one variable should be excluded.
- Variables in preliminary model: : LNBodyWt, LNLifeSpan, Gestation, PredF, DangrF, LNLifeSpan*PredF
- Recall that PredF (Predation) and DangrF (Danger) are highly correlated.
- PredF is included in an interaction term so exclude DangrF.

Backward Elimination - Next Steps - Cont’d

If model was modified in Step 4, rerun model through Backward Elimination (not always needed).
Interpret final model.
- Adjusted \(R^2\) = 0.655
- Model (next slide) looks complicated, but each animal is in only one Predation Category.
- Baseline Predation Category = 1

Backwards Elimination - Animal Data Final Model

                         Model Summary                          
---------------------------------------------------------------
R                       0.857       RMSE                 2.329 
R-Squared               0.734       MSE                  5.423 
Adj. R-Squared          0.655       Coef. Var           25.223 
Pred R-Squared          0.547       AIC                247.894 
MAE                     1.857       SBC                272.488 
---------------------------------------------------------------
 RMSE: Root Mean Square Error 
 MSE: Mean Square Error 
 MAE: Mean Absolute Error 
 AIC: Akaike Information Criteria 
 SBC: Schwarz Bayesian Criteria 

                              ANOVA                                
------------------------------------------------------------------
               Sum of                                             
              Squares        DF    Mean Square      F        Sig. 
------------------------------------------------------------------
Regression    734.163        11         66.742    9.294    0.0000 
Residual      265.708        37          7.181                    
Total         999.871        48                                   
------------------------------------------------------------------

                                      Parameter Estimates                                       
-----------------------------------------------------------------------------------------------
            model      Beta    Std. Error    Std. Beta      t        Sig       lower     upper 
-----------------------------------------------------------------------------------------------
      (Intercept)     6.751         3.305                  2.043    0.048      0.054    13.448 
         LNBodyWt    -0.698         0.244       -0.442    -2.859    0.007     -1.192    -0.203 
       LNLifeSpan     2.855         1.133        0.591     2.519    0.016      0.559     5.151 
        Gestation    -0.020         0.006       -0.447    -3.285    0.002     -0.032    -0.008 
           PredF2    13.998         4.132        0.041     3.388    0.002      5.626    22.369 
           PredF3    11.883         5.514       -0.494     2.155    0.038      0.711    23.056 
           PredF4     2.654         4.102        0.021     0.647    0.522     -5.658    10.966 
           PredF5    -0.782         4.262       -0.316    -0.183    0.855     -9.418     7.855 
LNLifeSpan:PredF2    -5.367         1.478       -0.471    -3.632    0.001     -8.361    -2.373 
LNLifeSpan:PredF3    -7.390         3.141       -0.588    -2.352    0.024    -13.755    -1.025 
LNLifeSpan:PredF4    -0.941         1.356       -0.083    -0.694    0.492     -3.689     1.807 
LNLifeSpan:PredF5    -1.043         1.446       -0.091    -0.721    0.475     -3.973     1.887 
-----------------------------------------------------------------------------------------------

Using Model to Find Estimates

Exporting Model and Data to Excel

This model can be used to find model estimates and residuals for all animals.
We will ALSO do these calculations in an Excel Spreadsheet to clarify each model component in estimate.
We export the data for three species to examine how the model works

Species	TotalSleep	LNBodyWt	LNLifeSpan	Gestation	PredF
ArcticFox	12.5	1.22	2.64	60	1
Baboon	9.8	2.36	3.30	180	4
Donkey	3.1	5.23	3.69	365	5

Using a Model to Find Estimates

Model coefficients for calculations can be extracted and exported to Excel.
We create a two column dataset listing each model component and it’s beta coefficient.
That dataset is exported as a .csv file for an in-class exercise.

model_term	beta
(Intercept)	6.7512
LNBodyWt	-0.6976
LNLifeSpan	2.8550
Gestation	-0.0198
PredF2	13.9979
PredF3	11.8834
PredF4	2.6536
PredF5	-0.7817
LNLifeSpan:PredF2	-5.3668
LNLifeSpan:PredF3	-7.3900
LNLifeSpan:PredF4	-0.9409
LNLifeSpan:PredF5	-1.0427

Lecture 16 In-class Exercises - Q2-Q3

Session ID: bua345s25

Use the provided worksheet to answer these questions:

Question 2. What is the regression estimate of total sleep for ‘Donkey’?

Question 3. What is the regression estimate of total sleep for ‘Artic Fox’ (ArticFox)?

At Home Practice:

Complete the worksheet for ‘Baboon’ at home.
At least one question on Quiz 2 may include an Excel Worksheet like this where you have to correctly do the calculation using the model and x values from the data.
You can use R, but code to add estimates to dataset will not be provided.
This exercise is about understanding the model estimation process.

Using a Model to Find Estimates in R

Model estimates can be calculated in R.
Excel Worksheet is used to demonstrate how those estimates are calculated.
You may see an estimate question based on a complex model on Quiz 2.

Species	TotalSleep	Est_TotalSleep	Resid	LNBodyWt	LNLifeSpan	Gestation	PredF
Africangiantpouchedrat	8.3	11.00	-2.70	0.00	1.50	42	3
Americanopossum	19.4	16.10	3.30	0.53	1.61	12	2
ArcticFox	12.5	12.25	0.25	1.22	2.64	60	1
Baboon	9.8	10.51	-0.71	2.36	3.30	180	4

Model Validation

How good is our model?
There are many ways to examine model fit.
Here are two straightforward ways:
- Check correlation between observed and estimated values
- Plot a scatterplot of observed and estimated values

Model Validation Plot (R = 0.86)

HW 7 Demo - Questions 1 - 11

Demo videos will be posted over break.
Read instructions in R project which correspond to Blackboard HW Assignment 7.
Run the Setup and import and examine the data.
Examine the correlation matrix of the X variables.
- Remove # symbols before incomplete R code and replace blanks (____) with correct commands to calculate correlation matrix with values rounded to 2 decimal places.
- Run line or whole code chunk to view correlation matrix which is large.
  - Helpful tip: On the ’Visual` tab of the R Markdown options change Editor content width to 1500.
- Remove # from the two lines of code at the bottom and run these lines to find largest positive and negative correlations in the matrix.
Answer Questions 1 - 2 based on the correlation matrix and min/max output.

HW 7 Demo - Questions 3 - 6

Run next chunk of code to Specify full model and Do Backward Elimination:
- Create the full model with all variables and no interactions.
- Run the Backward Elimination.
Answer questions 3 - 6 based on the Backward Elimination model output

HW 7 Demo - Questions 7 - 11

Run next code chunk Save the Final Model as final_ames_model.
Complete the code in the next chunk to Import New Data and Add Predictions and run code to add model estimates and residuals to new small dataset of two new houses.
- It is helpful to run the lines in this code block one at a time.
- Run the first command that begins new_houses <- read_csv(... to import a new small datset with 2 observations.
- Run the command that begins
  
  -(new_houses <- new_houses |> mutate(Est_Price... to add Est_Price, the regression estimates to this dataset.

HW 7 Demo - Questions 7 - 11 Continued

Remove # before the following three lines to complete them:
- #(new_houses <- new_houses |>
- # mutate(Resid = ____ - ____ |> round()) |>
- # relocate(Est_Price, Resid, .after=Price))
In the line with the blanks you are calculating residuals as
- Price minus Estimated Price (Resid = Price - Est_Price)
- The next line relocates Est_Price and Resid in the left side of the dataset, after Price.
Answer Questions 7 - 11 based on this output.

Model Selection Methods

Recall that in Multiple Linear Regression (MLR) the goal is to choose the simplest most accurate model, i.e. the ‘BEST’ set of independent variables
How do we decide which variables should be in our model?
There are many methods:
We’ve discussed Backward Elimination which can also be done manually in any software (not recommended).

Description of Other Model Selection Methods

Backward Elimination starts with all potential terms (including potential interaction terms) in the model and removes the least significant term for each step.
- This is referred to as starting with a full or saturated model.
Forward Selection: By default, this procedure starts with an empty model and adds the most significant term at each step until there are no more useful terms to add.
- Forward selection also needs to know what terms are in the full model.
Stepwise Selection: By default, this procedure starts with an empty model and then adds or removes a term for each step.
Common Practice: Try multiple methods to develop preliminary final model and then tweak as needed.

Notes about Model Selection

Using Multiple Methods

The steps for other methods are similar to the steps for Backward Elimination.
Not all steps are ALWAYS required. It depends on how complex the data are.
In the following example, we only need to do part of Step 1 plus Steps 2, 3, and 6.
- For Step 1, we only need to examine correlations.
- In this case, Step 7 will be apparent.
- We can add model estimates to data for future interpretation (Step 8)

Steps for Model Selection Using Multiple Methods

Examine Matrix of Scatterplots and histograms and determine if any transformations are needed to linearize relationships between continuous predictors and response variable.

Also look at correlation matrix to check if there are pairs of variables to be concerned about.

Create a ‘saturated’ model with all potential predictor variables and interaction terms (Subjective!).
Use Backward Elimination, Forward Selection, and Stepwise Selection to find preliminary candidate models. (These are automated procedures!)

Carefully examine results to see where these candidate models agree and disagree.

Steps for Model Selection Cont’d

Examine predictors in preliminary candidate models to confirm they are not too highly correlated with each other.

If two predictor variables in any model have a correlation of 0.8 or greater, drop one of them.

Rerun model selection methods, if a candidate model is substantially changed (not always needed).
Compare model fit statistics from final candidate model from all three methods.
Decide on final candidate and make final modifications, if needed.
Interpret final model.

Wine Data - Model Selection Example

Can we determine what factors affect wine quality even if we KNOW NOTHING about wine cultivation and chemistry?

Maybe!

Since we have no prior knowledge, we start with a straightforward full model with all available predictors and no interactions.
- In practice, a consultant would be working with a wine expert to carefully determine a saturated model that includes all possible interactions.

Import Wine Data

Notice that all variables are numeric (<dbl> stands for decimal value).

Code

```{r import and examine data, echo=T}
wine <- read_csv("data/wine.csv", show_col_types = F) 
head(wine) |> kable()
```

Wine_Quality	Fixed_Acidity	Volatile_Acidity	Citric_Acidity	Residual_Sugar	Chlorides	Free_Sulphur_Dioxide	Total_Sulphur_Dioxide	Ph	Sulfate	Alcohol
5	9.3	0.48	0.29	2.1	0.127	6	16	3.22	0.72	11.2
6	9.1	0.22	0.24	2.1	0.078	1	28	3.41	0.87	10.3
7	7.9	0.34	0.36	1.9	0.065	5	10	3.27	0.54	11.2
5	7.2	1.00	0.00	3.0	0.102	7	16	3.43	0.46	10.0
7	11.9	0.43	0.66	3.1	0.109	10	23	3.15	0.85	10.4
5	7.2	0.49	0.24	2.2	0.070	5	36	3.33	0.48	9.4

Examine Correlation matrix for Multicollinearity

Code

```{r examine wine correlation matrix, echo=T}
(cor_wine <- wine |> cor() |> round(2)) #  correlation matrix 

max(cor_wine[cor_wine < 1])
min(cor_wine)
```

                      Wine_Quality Fixed_Acidity Volatile_Acidity
Wine_Quality                  1.00          0.11            -0.39
Fixed_Acidity                 0.11          1.00            -0.23
Volatile_Acidity             -0.39         -0.23             1.00
Citric_Acidity                0.22          0.68            -0.52
Residual_Sugar                0.04          0.20            -0.01
Chlorides                    -0.10          0.12             0.04
Free_Sulphur_Dioxide          0.01         -0.18            -0.05
Total_Sulphur_Dioxide        -0.08         -0.13             0.05
Ph                           -0.06         -0.70             0.19
Sulfate                       0.21          0.19            -0.24
Alcohol                       0.45         -0.08            -0.17
                      Citric_Acidity Residual_Sugar Chlorides
Wine_Quality                    0.22           0.04     -0.10
Fixed_Acidity                   0.68           0.20      0.12
Volatile_Acidity               -0.52          -0.01      0.04
Citric_Acidity                  1.00           0.16      0.21
Residual_Sugar                  0.16           1.00      0.05
Chlorides                       0.21           0.05      1.00
Free_Sulphur_Dioxide           -0.07           0.18     -0.04
Total_Sulphur_Dioxide           0.06           0.18      0.00
Ph                             -0.55          -0.14     -0.26
Sulfate                         0.27          -0.01      0.35
Alcohol                         0.10           0.07     -0.21
                      Free_Sulphur_Dioxide Total_Sulphur_Dioxide    Ph Sulfate
Wine_Quality                          0.01                 -0.08 -0.06    0.21
Fixed_Acidity                        -0.18                 -0.13 -0.70    0.19
Volatile_Acidity                     -0.05                  0.05  0.19   -0.24
Citric_Acidity                       -0.07                  0.06 -0.55    0.27
Residual_Sugar                        0.18                  0.18 -0.14   -0.01
Chlorides                            -0.04                  0.00 -0.26    0.35
Free_Sulphur_Dioxide                  1.00                  0.65  0.08    0.00
Total_Sulphur_Dioxide                 0.65                  1.00 -0.07    0.08
Ph                                    0.08                 -0.07  1.00   -0.24
Sulfate                               0.00                  0.08 -0.24    1.00
Alcohol                              -0.03                 -0.08  0.21    0.05
                      Alcohol
Wine_Quality             0.45
Fixed_Acidity           -0.08
Volatile_Acidity        -0.17
Citric_Acidity           0.10
Residual_Sugar           0.07
Chlorides               -0.21
Free_Sulphur_Dioxide    -0.03
Total_Sulphur_Dioxide   -0.08
Ph                       0.21
Sulfate                  0.05
Alcohol                  1.00
[1] 0.68
[1] -0.7

Model Selection

We specify a full model using an easy shortcut:
- If all variables are included, you can use . instead of listing them all.
- This model specification is also used in HW 7.
The we do three model selection procedures:
- Backward Elimination (BE)
- Forward Selection (FS)
- Stepwise Selection (SS)

Code

```{r specify full model, echo=T}
wine_full <- lm(Wine_Quality ~ ., data = wine)                            # specify full model
wine_BE <- ols_step_backward_p(wine_full, progress=F, p_val=0.1)          # backward elimination  
wine_FS <- ols_step_forward_p(wine_full, progress=F, p_val=0.1)           # forward selection
wine_SS <- ols_step_both_p(wine_full, progress=F, p_val=0.1)              # stepwise selection
```

Comparing Model Results

Look at the LAST step for each method to determine which method results in the best fit.
Comparison Measures:
- Adj. \(R^2\): Higher value indicates better model fit
- C(p): Lower value indicates better model fit (Also referred to as Mallow’s C(p)).
- AIC: Lower value indicates better model fit (Akaike Information Criteria).
- RMSE: Lower value indicates better model fit (Root mean Square Error).
By comparing these measures and accounting for our understanding of these procedures, we can determine that TWO of these methods arrived at the same model.

Lecture 16 In-class Exercises - Q4

Session ID: bua345s25

Which two model selection methods arrived at the same model for the wine data?

On the next few slides I will show pairs of stepwise summaries so you can compare them.

Backwards Elimination and Forward Selection

Backward Elimination

Forward Selection

Backwards Elimination and Stepwise Selection

Backward Elimination

Stepwise Selection

Forward Selection and Stepwise Selection

Forward Selection

Stepwise Selection

Wine Model Validation Plot (R = 0.58)

Key Points from this Week

Regression modeling can be overwhelming
- Automating part of the variable selection process is helpful.
- Try different methods and compare results.
- Results from automated processes are preliminary.
- Model estimates and residuals can be added to dataset.
  - Demonstrated in HW 7.
HW 6 due on Wed. 3/5 (Grace Period extended until 3/7).
HW 7 is posted and is due on Wed. 3/19
Date of Quiz 2 has been changed to Tuesday, 4/1.

To submit an Engagement Question or Comment about material from Lecture 16: Submit it by midnight today (day of lecture).