Recall that in Multiple Linear Regression (MLR) the goal is to choose the simplest most accurate model, i.e. the ‘BEST’ set of independent variables
How do we decide which variables should be in our model?
There are many methods:
We’ve discussed Backward Elimination which can also be done manually in any software (not recommended).
Backward Elimination starts with all potential terms (including potential interaction terms) in the model and removes the least significant term for each step.
This is referred to as starting with a full or saturated model.
Forward Selection: By default, this procedure starts with an empty model and adds the most significant term at each step until there are no more useful terms to add.
Forward selection also needs to know what terms are in the full model.
Stepwise Selection: By default, this procedure starts with an empty model and then adds or removes a term for each step.
Comments about Model Selection Methods
Common Practice: Try multiple methods to develop preliminary final model and then tweak as needed.
Steps for model selection using multiple methods are similar to the steps for Backward Elimination (Week 8 Lectures)
Not all steps are ALWAYS required. It depends on how complex the data are.
In the following example, we only need to do part of Step 1 plus Steps 2, 3, and 6.
For Step 1, we only need to examine correlations.
In this case, Step 7 will be apparent.
We can add model estimates to data for future interpretation (Step 8)
💥 Lecture 17 In-class Exercises - Q2 💥
Which model selection method is characterized by starting with NO (0) terms in the model and then adding terms one by one until no more terms added are significant to the model?
A Backward Elimination
B Stepwise Selection
C Forward Selection
D Adjusted \(R^2\)
Steps for Model Selection Using Multiple Methods
Examine Matrix of Scatterplots and histograms and determine if any transformations are needed to linearize relationships between continuous predictors and response variable.
Also look at correlation matrix to check if there are pairs of variables to be concerned about.
Create a ‘saturated’ model with all potential predictor variables and interaction terms (Subjective!).
Use Backward Elimination, Forward Selection, and Stepwise Selection to find preliminary candidate models. (These are automated procedures!)
Carefully examine results to see where these candidate models agree and disagree.
Steps for Model Selection Using Multiple Methods Cont’d
Examine predictors in preliminary candidate models to confirm they are not too highly correlated with each other.
If two predictor variables in any model have a correlation of 0.8 or greater, drop one of them.
Rerun model selection methods, if a candidate model is substantially changed (not always needed).
Compare model fit statistics from final candidate model from all three methods.
Decide on final candidate and make final modifications, if needed.
Look at the LAST step for each method to determine which method results in the best fit.
Comparison Measures:
Adj.\(R^2\): Higher value indicates better model fit
C(p): Lower value indicates better model fit (Also referred to as Mallow’s C(p)).
AIC: Lower value indicates better model fit (Akaike Information Criteria).
RMSE: Lower value indicates better model fit (Root mean Square Error).
By comparing these measures and accounting for our understanding of these procedures, we can determine that TWO of these methods arrived at the same model.
💥 Lecture 17 In-class Exercises - Q3 💥
Session ID: bua345s24
Which two model selection methods arrived at the same model for the wine data?
On the next few slides I will show pairs of stepwise summaries so you can compare them.
Backwards Elimination and Forward Selection
Backward Elimination
Forward Selection
Backwards Elimination and Stepwise Selection
Backward Elimination
Stepwise Selection
Forward Selection and Stepwise Selection
Forward Selection
Stepwise Selection
Wine Model Validation Plot (R = 0.58)
Best Subsets
Another model selection method is ‘Best Subsets’
Output shows ‘Best’ one variable model, ‘Best’ two variable model, ‘Best’ three variable model, etc.
Each ‘Best’ model is determined by multiple Fit Statistics.
This method then examines which of these candidates is the overall best by comparing their fit statistics.
If we are fortunate, the optimal choice from Best Subsets matches a model above.
In this case (and HW 8) we are fortunate.
NOTE: ols_step_best_subset command is VERY slow. You do not need to rerun it. Output is provided.
Some of the Best Subsets PLots
Reading Best Subsets Output
Tabular Output
Bottom table shows which model performs best, based on all of the fit statistics.
For example, if model 3 (Three variable model) was best, it would have the HIGHEST Adjusted \(R^2\), Lowest C(p), and Lowest AIC.
We can see from bottom table that Model 3 is not the best.
Model 7 IS the best because it does have the HIGHEST Adjusted R2, Lowest C(p), and Lowest AIC.
Top table lists the variables in each of the ‘Best’ models.
Wine Best Subset Output
Preview of HW 8 - Part 1
Review model comparisons for Animal Data from first part of lecture.
Compare the optimal best subset model (Model 7) to the model found by both Backward Elimination and Forward Selection.
The goal is to determine to what extent they agree.
Spoiler: They are in complete agreement which indicates that we have consensus on the model for these data.
Reminder of Upcoming Dates
Today’s Lecture (3/18) is the third and final lecture on model and variable selection.
HW 7 is due tomorrow, Wed., 3/19.
HW 8 is now posted and is due Wednesday, 3/26
Part 1 pertains to Lectures 15-17
Part 2 pertains to Lecture 18
Quiz 2 is on Tuesday, April 1st, in the classroom
Practice Questions will be posted this weekend.
Key Points from this Week
Regression modeling can be overwhelming because of all of the possible options.
Automating part of the variable selection process is helpful.
Trying different methods and comparing results is strongly recommended.
Results from Automated processes are preliminary models that can (and should) be tinkered with.
Once we have a final model we can add regression estimates and residuals to the dataset.
Methods Covered: Backwards Elimination, Forward Selection, Stepwise Selection, Best Subsets
Compare results from multiple methods
To submit an Engagement Question or Comment about material from Lecture 17: Submit it by midnight today (day of lecture).