Species | TotalSleep | BodyWt | LNBodyWt | BrainWt | LNBrainWt | LifeSpan | LNLifeSpan | Gestation | Predation | Exposure | Danger |
---|---|---|---|---|---|---|---|---|---|---|---|
Africangiantpouchedrat | 8.3 | 1.00 | 0.00 | 6.6 | 1.89 | 4.5 | 1.50 | 42 | 3 | 1 | 3 |
Americanopossum | 19.4 | 1.70 | 0.53 | 6.3 | 1.84 | 5.0 | 1.61 | 12 | 2 | 1 | 1 |
ArcticFox | 12.5 | 3.39 | 1.22 | 44.5 | 3.80 | 14.0 | 2.64 | 60 | 1 | 1 | 1 |
Baboon | 9.8 | 10.55 | 2.36 | 179.5 | 5.19 | 27.0 | 3.30 | 180 | 4 | 4 | 4 |
Bigbrownbat | 19.7 | 0.02 | -3.77 | 0.3 | -1.20 | 19.0 | 2.94 | 35 | 1 | 1 | 1 |
Braziliantapir | 6.2 | 160.00 | 5.08 | 169.0 | 5.13 | 30.4 | 3.41 | 392 | 4 | 5 | 4 |
BUA 345 - Lecture 16
Introduction to Model Selection Continued
Housekeeping
HW 6 was due 3/5/2025 - 2 day grace period
- Demo videos were posted on Sunday morning
HW 7 is available and is due on Wednesday, 3/19.
Quiz 2 will be on 4/1/2025 - Date has changed and syllabus has been updated.
Today’s plan
Implementing partially automated model selection.
Backward Elimination for Model Selection
HW 7 Demo
Model Selection using Multiple Methods
In-class Polling (Session ID: bua345s25)
Animals Data
Animals Data Dictionary - Description of Variables
Variable | Type | Description |
---|---|---|
Species | Nominal | Name of Species |
TotalSleep | Quantitative | Total Sleep |
BodyWt | Quantitative | Average Body Weight in kilograms |
LNBodyWt | Quantitative | Natural Log of Body Weight |
BrainWt | Quantitative | Average Brain Weight in grams |
LNBrainWt | Quantitative | Natural Log of Brain Weight |
LifeSpan | Quantitative | Maximum Life Span in years |
LNLifeSpan | Quantitative | Natural Log of Life Span |
Gestation | Quantitative | Gestation Time in days |
Predation | Ordinal | Predation Index (1=least likely to be prey) |
Exposure | Ordinal | Sleep Exposure Index (1=least exposed) |
Danger | Ordinal | Overall Danger Index (1=least danger from other animals) |
Lecture 16 In-class Exercises - Q1
Session ID: bua345s25
Which two ordinal categorical predictor variables appear to be multicollinear, i.e., highly correlated?
Scatterplot Matrix
Visual Representation of Correlations
Backward Elimination
Data examination and transformations completed
Create a full ‘saturated’ model with all potential predictor variables and interaction terms (This is subjective).
Code
```{r animals full model, echo=T}
# convert ordinal variables to factors
animals <- animals |>
mutate(PredF = factor(Predation),
ExposF = factor(Exposure),
DangrF=factor(Danger))
# full model (subjective)
animals_full <- lm(TotalSleep ~ LNBodyWt + LNBrainWt +
LNLifeSpan + Gestation +
PredF + ExposF + DangrF +
LNBodyWt*Gestation + LNLifeSpan*PredF +
LNLifeSpan*ExposF + LNLifeSpan*DangrF, data=animals)
```
Backward Elimination Cont’d
Use ‘Backward Elimination’ to pare full model down to a preliminary model.
- We cast a wide net by specifying that terms will remain in model if p-value < 0.1.
Code
Note: model has aliased coefficients
sums of squares computed by model comparison
Stepwise Summary
--------------------------------------------------------------------------------
Step Variable AIC SBC SBIC R2 Adj. R2
--------------------------------------------------------------------------------
0 Full Model 240.461 299.107 39.729 0.89048 0.73714
1 LNLifeSpan:DangrF 232.882 283.961 40.124 0.88953 0.76946
2 LNBrainWt 231.276 280.464 40.494 0.88864 0.77728
3 LNLifeSpan:ExposF 229.366 270.986 46.531 0.87390 0.78383
4 LNBodyWt:Gestation 227.366 267.095 46.512 0.87390 0.79129
5 ExposF 227.508 259.669 54.605 0.85111 0.78343
--------------------------------------------------------------------------------
Final Model Output
------------------
Model Summary
---------------------------------------------------------------
R 0.923 RMSE 1.743
R-Squared 0.851 MSE 3.038
Adj. R-Squared 0.783 Coef. Var 19.991
Pred R-Squared 0.660 AIC 227.508
MAE 1.283 SBC 259.669
---------------------------------------------------------------
RMSE: Root Mean Square Error
MSE: Mean Square Error
MAE: Mean Absolute Error
AIC: Akaike Information Criteria
SBC: Schwarz Bayesian Criteria
ANOVA
-------------------------------------------------------------------
Sum of
Squares DF Mean Square F Sig.
-------------------------------------------------------------------
Regression 851.000 15 56.733 12.576 0.0000
Residual 148.871 33 4.511
Total 999.871 48
-------------------------------------------------------------------
Parameter Estimates
------------------------------------------------------------------------------------------------
model Beta Std. Error Std. Beta t Sig lower upper
------------------------------------------------------------------------------------------------
(Intercept) 6.324 2.635 2.401 0.022 0.964 11.684
LNBodyWt -0.813 0.202 -0.515 -4.027 0.000 -1.224 -0.402
LNLifeSpan 3.009 0.909 0.622 3.311 0.002 1.160 4.858
Gestation -0.019 0.005 -0.424 -3.736 0.001 -0.029 -0.009
PredF2 14.639 3.291 1.431 4.448 0.000 7.944 21.335
PredF3 17.053 5.383 1.237 3.168 0.003 6.101 28.005
PredF4 11.414 4.830 0.884 2.363 0.024 1.587 21.241
PredF5 0.722 6.052 0.067 0.119 0.906 -11.592 13.035
DangrF2 -6.810 1.746 -0.648 -3.900 0.000 -10.363 -3.258
DangrF3 -8.701 3.444 -0.674 -2.527 0.016 -15.708 -1.695
DangrF4 -7.957 4.344 -0.651 -1.832 0.076 -16.794 0.881
DangrF5 -16.325 4.456 -1.265 -3.664 0.001 -25.390 -7.259
LNLifeSpan:PredF2 -3.334 1.299 -0.794 -2.567 0.015 -5.976 -0.692
LNLifeSpan:PredF3 -5.444 3.940 -0.615 -1.382 0.176 -13.459 2.571
LNLifeSpan:PredF4 -1.160 1.537 -0.253 -0.755 0.456 -4.286 1.967
LNLifeSpan:PredF5 3.357 1.868 0.887 1.797 0.081 -0.443 7.157
------------------------------------------------------------------------------------------------
Backward Elimination - Preliminary Model
- Note that each category of each factor variable is shown making model look more complex than it is.
Backward Elimination - Next Steps
Examine predictors in preliminary model to confirm they are not too highly correlated with each other.
If correlation for two variables, \(R_{XY} \geq 0.8\), then one variable should be excluded.
Variables in preliminary model: :
LNBodyWt
,LNLifeSpan
,Gestation
,PredF
,DangrF
,LNLifeSpan*PredF
Recall that
PredF
(Predation) andDangrF
(Danger) are highly correlated.PredF
is included in an interaction term so excludeDangrF
.
Backward Elimination - Next Steps - Cont’d
If model was modified in Step 4, rerun model through Backward Elimination (not always needed).
Interpret final model.
- Adjusted \(R^2\) = 0.655
- Model (next slide) looks complicated, but each animal is in only one Predation Category.
- Baseline Predation Category = 1
Backwards Elimination - Animal Data Final Model
Model Summary
---------------------------------------------------------------
R 0.857 RMSE 2.329
R-Squared 0.734 MSE 5.423
Adj. R-Squared 0.655 Coef. Var 25.223
Pred R-Squared 0.547 AIC 247.894
MAE 1.857 SBC 272.488
---------------------------------------------------------------
RMSE: Root Mean Square Error
MSE: Mean Square Error
MAE: Mean Absolute Error
AIC: Akaike Information Criteria
SBC: Schwarz Bayesian Criteria
ANOVA
------------------------------------------------------------------
Sum of
Squares DF Mean Square F Sig.
------------------------------------------------------------------
Regression 734.163 11 66.742 9.294 0.0000
Residual 265.708 37 7.181
Total 999.871 48
------------------------------------------------------------------
Parameter Estimates
-----------------------------------------------------------------------------------------------
model Beta Std. Error Std. Beta t Sig lower upper
-----------------------------------------------------------------------------------------------
(Intercept) 6.751 3.305 2.043 0.048 0.054 13.448
LNBodyWt -0.698 0.244 -0.442 -2.859 0.007 -1.192 -0.203
LNLifeSpan 2.855 1.133 0.591 2.519 0.016 0.559 5.151
Gestation -0.020 0.006 -0.447 -3.285 0.002 -0.032 -0.008
PredF2 13.998 4.132 0.041 3.388 0.002 5.626 22.369
PredF3 11.883 5.514 -0.494 2.155 0.038 0.711 23.056
PredF4 2.654 4.102 0.021 0.647 0.522 -5.658 10.966
PredF5 -0.782 4.262 -0.316 -0.183 0.855 -9.418 7.855
LNLifeSpan:PredF2 -5.367 1.478 -0.471 -3.632 0.001 -8.361 -2.373
LNLifeSpan:PredF3 -7.390 3.141 -0.588 -2.352 0.024 -13.755 -1.025
LNLifeSpan:PredF4 -0.941 1.356 -0.083 -0.694 0.492 -3.689 1.807
LNLifeSpan:PredF5 -1.043 1.446 -0.091 -0.721 0.475 -3.973 1.887
-----------------------------------------------------------------------------------------------
Using Model to Find Estimates
Exporting Model and Data to Excel
This model can be used to find model estimates and residuals for all animals.
We will ALSO do these calculations in an Excel Spreadsheet to clarify each model component in estimate.
We export the data for three species to examine how the model works
Species | TotalSleep | LNBodyWt | LNLifeSpan | Gestation | PredF |
---|---|---|---|---|---|
ArcticFox | 12.5 | 1.22 | 2.64 | 60 | 1 |
Baboon | 9.8 | 2.36 | 3.30 | 180 | 4 |
Donkey | 3.1 | 5.23 | 3.69 | 365 | 5 |
Using a Model to Find Estimates
Model coefficients for calculations can be extracted and exported to Excel.
We create a two column dataset listing each model component and it’s beta coefficient.
That dataset is exported as a .csv file for an in-class exercise.
model_term | beta |
---|---|
(Intercept) | 6.7512 |
LNBodyWt | -0.6976 |
LNLifeSpan | 2.8550 |
Gestation | -0.0198 |
PredF2 | 13.9979 |
PredF3 | 11.8834 |
PredF4 | 2.6536 |
PredF5 | -0.7817 |
LNLifeSpan:PredF2 | -5.3668 |
LNLifeSpan:PredF3 | -7.3900 |
LNLifeSpan:PredF4 | -0.9409 |
LNLifeSpan:PredF5 | -1.0427 |
Lecture 16 In-class Exercises - Q2-Q3
Session ID: bua345s25
Use the provided worksheet to answer these questions:
Question 2. What is the regression estimate of total sleep for ‘Donkey’?
Question 3. What is the regression estimate of total sleep for ‘Artic Fox’ (ArticFox
)?
At Home Practice:
Complete the worksheet for ‘Baboon’ at home.
At least one question on Quiz 2 may include an Excel Worksheet like this where you have to correctly do the calculation using the model and x values from the data.
You can use R, but code to add estimates to dataset will not be provided.
This exercise is about understanding the model estimation process.
Using a Model to Find Estimates in R
Model estimates can be calculated in R.
Excel Worksheet is used to demonstrate how those estimates are calculated.
You may see an estimate question based on a complex model on Quiz 2.
Species | TotalSleep | Est_TotalSleep | Resid | LNBodyWt | LNLifeSpan | Gestation | PredF |
---|---|---|---|---|---|---|---|
Africangiantpouchedrat | 8.3 | 11.00 | -2.70 | 0.00 | 1.50 | 42 | 3 |
Americanopossum | 19.4 | 16.10 | 3.30 | 0.53 | 1.61 | 12 | 2 |
ArcticFox | 12.5 | 12.25 | 0.25 | 1.22 | 2.64 | 60 | 1 |
Baboon | 9.8 | 10.51 | -0.71 | 2.36 | 3.30 | 180 | 4 |
Model Validation
How good is our model?
There are many ways to examine model fit.
Here are two straightforward ways:
- Check correlation between observed and estimated values
- Plot a scatterplot of observed and estimated values
Model Validation Plot (R = 0.86)
HW 7 Demo - Questions 1 - 11
Demo videos will be posted over break.
Read instructions in R project which correspond to Blackboard HW Assignment 7.
Run the
Setup
and import and examine the data.Examine the correlation matrix of the X variables.
Remove
#
symbols before incomplete R code and replace blanks (____
) with correct commands to calculate correlation matrix with values rounded to 2 decimal places.Run line or whole code chunk to view correlation matrix which is large.
- Helpful tip: On the ’Visual` tab of the R Markdown options change Editor content width to 1500.
Remove
#
from the two lines of code at the bottom and run these lines to find largest positive and negative correlations in the matrix.
Answer Questions 1 - 2 based on the correlation matrix and min/max output.
HW 7 Demo - Questions 3 - 6
Run next chunk of code to
Specify full model and Do Backward Elimination
:Create the full model with all variables and no interactions.
Run the Backward Elimination.
Answer questions 3 - 6 based on the Backward Elimination model output
HW 7 Demo - Questions 7 - 11
Run next code chunk
Save the Final Model
asfinal_ames_model
.Complete the code in the next chunk to
Import New Data and Add Predictions
and run code to add model estimates and residuals to new small dataset of two new houses.It is helpful to run the lines in this code block one at a time.
Run the first command that begins
new_houses <- read_csv(...
to import a new small datset with 2 observations.Run the command that begins
-
(new_houses <- new_houses |> mutate(Est_Price...
to addEst_Price
, the regression estimates to this dataset.
HW 7 Demo - Questions 7 - 11 Continued
Remove
#
before the following three lines to complete them:#(new_houses <- new_houses |>
# mutate(Resid = ____ - ____ |> round()) |>
# relocate(Est_Price, Resid, .after=Price))
In the line with the blanks you are calculating residuals as
Price minus Estimated Price (
Resid = Price - Est_Price
)The next line relocates
Est_Price
andResid
in the left side of the dataset, afterPrice
.
Answer Questions 7 - 11 based on this output.
Model Selection Methods
Recall that in Multiple Linear Regression (MLR) the goal is to choose the simplest most accurate model, i.e. the ‘BEST’ set of independent variables
How do we decide which variables should be in our model?
There are many methods:
We’ve discussed Backward Elimination which can also be done manually in any software (not recommended).
Description of Other Model Selection Methods
Backward Elimination starts with all potential terms (including potential interaction terms) in the model and removes the least significant term for each step.
- This is referred to as starting with a full or saturated model.
Forward Selection: By default, this procedure starts with an empty model and adds the most significant term at each step until there are no more useful terms to add.
- Forward selection also needs to know what terms are in the full model.
Stepwise Selection: By default, this procedure starts with an empty model and then adds or removes a term for each step.
Common Practice: Try multiple methods to develop preliminary final model and then tweak as needed.
Notes about Model Selection
Using Multiple Methods
The steps for other methods are similar to the steps for Backward Elimination.
Not all steps are ALWAYS required. It depends on how complex the data are.
In the following example, we only need to do part of Step 1 plus Steps 2, 3, and 6.
For Step 1, we only need to examine correlations.
In this case, Step 7 will be apparent.
We can add model estimates to data for future interpretation (Step 8)
Steps for Model Selection Using Multiple Methods
- Examine Matrix of Scatterplots and histograms and determine if any transformations are needed to linearize relationships between continuous predictors and response variable.
- Also look at correlation matrix to check if there are pairs of variables to be concerned about.
Create a ‘saturated’ model with all potential predictor variables and interaction terms (Subjective!).
Use Backward Elimination, Forward Selection, and Stepwise Selection to find preliminary candidate models. (These are automated procedures!)
- Carefully examine results to see where these candidate models agree and disagree.
Steps for Model Selection Cont’d
- Examine predictors in preliminary candidate models to confirm they are not too highly correlated with each other.
- If two predictor variables in any model have a correlation of 0.8 or greater, drop one of them.
Rerun model selection methods, if a candidate model is substantially changed (not always needed).
Compare model fit statistics from final candidate model from all three methods.
Decide on final candidate and make final modifications, if needed.
Interpret final model.
Wine Data - Model Selection Example
Can we determine what factors affect wine quality even if we KNOW NOTHING about wine cultivation and chemistry?
Maybe!
Since we have no prior knowledge, we start with a straightforward full model with all available predictors and no interactions.
- In practice, a consultant would be working with a wine expert to carefully determine a saturated model that includes all possible interactions.
Import Wine Data
Notice that all variables are numeric (<dbl>
stands for decimal value).
Code
Wine_Quality | Fixed_Acidity | Volatile_Acidity | Citric_Acidity | Residual_Sugar | Chlorides | Free_Sulphur_Dioxide | Total_Sulphur_Dioxide | Ph | Sulfate | Alcohol |
---|---|---|---|---|---|---|---|---|---|---|
5 | 9.3 | 0.48 | 0.29 | 2.1 | 0.127 | 6 | 16 | 3.22 | 0.72 | 11.2 |
6 | 9.1 | 0.22 | 0.24 | 2.1 | 0.078 | 1 | 28 | 3.41 | 0.87 | 10.3 |
7 | 7.9 | 0.34 | 0.36 | 1.9 | 0.065 | 5 | 10 | 3.27 | 0.54 | 11.2 |
5 | 7.2 | 1.00 | 0.00 | 3.0 | 0.102 | 7 | 16 | 3.43 | 0.46 | 10.0 |
7 | 11.9 | 0.43 | 0.66 | 3.1 | 0.109 | 10 | 23 | 3.15 | 0.85 | 10.4 |
5 | 7.2 | 0.49 | 0.24 | 2.2 | 0.070 | 5 | 36 | 3.33 | 0.48 | 9.4 |
Examine Correlation matrix for Multicollinearity
Code
Wine_Quality Fixed_Acidity Volatile_Acidity
Wine_Quality 1.00 0.11 -0.39
Fixed_Acidity 0.11 1.00 -0.23
Volatile_Acidity -0.39 -0.23 1.00
Citric_Acidity 0.22 0.68 -0.52
Residual_Sugar 0.04 0.20 -0.01
Chlorides -0.10 0.12 0.04
Free_Sulphur_Dioxide 0.01 -0.18 -0.05
Total_Sulphur_Dioxide -0.08 -0.13 0.05
Ph -0.06 -0.70 0.19
Sulfate 0.21 0.19 -0.24
Alcohol 0.45 -0.08 -0.17
Citric_Acidity Residual_Sugar Chlorides
Wine_Quality 0.22 0.04 -0.10
Fixed_Acidity 0.68 0.20 0.12
Volatile_Acidity -0.52 -0.01 0.04
Citric_Acidity 1.00 0.16 0.21
Residual_Sugar 0.16 1.00 0.05
Chlorides 0.21 0.05 1.00
Free_Sulphur_Dioxide -0.07 0.18 -0.04
Total_Sulphur_Dioxide 0.06 0.18 0.00
Ph -0.55 -0.14 -0.26
Sulfate 0.27 -0.01 0.35
Alcohol 0.10 0.07 -0.21
Free_Sulphur_Dioxide Total_Sulphur_Dioxide Ph Sulfate
Wine_Quality 0.01 -0.08 -0.06 0.21
Fixed_Acidity -0.18 -0.13 -0.70 0.19
Volatile_Acidity -0.05 0.05 0.19 -0.24
Citric_Acidity -0.07 0.06 -0.55 0.27
Residual_Sugar 0.18 0.18 -0.14 -0.01
Chlorides -0.04 0.00 -0.26 0.35
Free_Sulphur_Dioxide 1.00 0.65 0.08 0.00
Total_Sulphur_Dioxide 0.65 1.00 -0.07 0.08
Ph 0.08 -0.07 1.00 -0.24
Sulfate 0.00 0.08 -0.24 1.00
Alcohol -0.03 -0.08 0.21 0.05
Alcohol
Wine_Quality 0.45
Fixed_Acidity -0.08
Volatile_Acidity -0.17
Citric_Acidity 0.10
Residual_Sugar 0.07
Chlorides -0.21
Free_Sulphur_Dioxide -0.03
Total_Sulphur_Dioxide -0.08
Ph 0.21
Sulfate 0.05
Alcohol 1.00
[1] 0.68
[1] -0.7
Model Selection
We specify a full model using an easy shortcut:
If all variables are included, you can use
.
instead of listing them all.This model specification is also used in HW 7.
The we do three model selection procedures:
- Backward Elimination (BE)
- Forward Selection (FS)
- Stepwise Selection (SS)
Code
```{r specify full model, echo=T}
wine_full <- lm(Wine_Quality ~ ., data = wine) # specify full model
wine_BE <- ols_step_backward_p(wine_full, progress=F, p_val=0.1) # backward elimination
wine_FS <- ols_step_forward_p(wine_full, progress=F, p_val=0.1) # forward selection
wine_SS <- ols_step_both_p(wine_full, progress=F, p_val=0.1) # stepwise selection
```
Comparing Model Results
Look at the LAST step for each method to determine which method results in the best fit.
Comparison Measures:
Adj. \(R^2\): Higher value indicates better model fit
C(p): Lower value indicates better model fit (Also referred to as Mallow’s C(p)).
AIC: Lower value indicates better model fit (Akaike Information Criteria).
RMSE: Lower value indicates better model fit (Root mean Square Error).
By comparing these measures and accounting for our understanding of these procedures, we can determine that TWO of these methods arrived at the same model.
Lecture 16 In-class Exercises - Q4
Session ID: bua345s25
Which two model selection methods arrived at the same model for the wine data?
- On the next few slides I will show pairs of stepwise summaries so you can compare them.
Backwards Elimination and Forward Selection
Backward Elimination
Forward Selection
Backwards Elimination and Stepwise Selection
Backward Elimination
Stepwise Selection
Forward Selection and Stepwise Selection
Forward Selection
Stepwise Selection
Wine Model Validation Plot (R = 0.58)
Key Points from this Week
Regression modeling can be overwhelming
Automating part of the variable selection process is helpful.
Try different methods and compare results.
Results from automated processes are preliminary.
Model estimates and residuals can be added to dataset.
- Demonstrated in HW 7.
HW 6 due on Wed. 3/5 (Grace Period extended until 3/7).
HW 7 is posted and is due on Wed. 3/19
Date of Quiz 2 has been changed to Tuesday, 4/1.
To submit an Engagement Question or Comment about material from Lecture 16: Submit it by midnight today (day of lecture).