In Lecture 14, categorical terms and interactions had a simple interpretation:
Each category has a unique SLR model:
The intercepts for different categories may or may not be different from baseline category(check P-values)
The slopes for different categories may or may not be different from baseline category (check P-values)
There are other kinds of interaction terms.
The first one we will discuss is an interaction between two QUANTITATIVE variables.
Example MLR with Quantitative Interaction Term
One POSSIBLE model for these data (there are many):
# save and print mlr model output(insure_mlr_quant1 <-ols_regress(ln_Charges ~ Age + BMI + Children + Age*Children, data=insure))
Model Summary
----------------------------------------------------------------
R 0.555 RMSE 0.765
R-Squared 0.307 MSE 0.585
Adj. R-Squared 0.305 Coef. Var 8.423
Pred R-Squared 0.302 AIC 3091.957
MAE 0.626 SBC 3123.150
----------------------------------------------------------------
RMSE: Root Mean Square Error
MSE: Mean Square Error
MAE: Mean Absolute Error
AIC: Akaike Information Criteria
SBC: Schwarz Bayesian Criteria
ANOVA
-----------------------------------------------------------------------
Sum of
Squares DF Mean Square F Sig.
-----------------------------------------------------------------------
Regression 347.605 4 86.901 147.968 0.0000
Residual 782.869 1333 0.587
Total 1130.474 1337
-----------------------------------------------------------------------
Parameter Estimates
-----------------------------------------------------------------------------------------
model Beta Std. Error Std. Beta t Sig lower upper
-----------------------------------------------------------------------------------------
(Intercept) 7.195 0.127 56.858 0.000 6.947 7.443
Age 0.037 0.002 0.496 20.009 0.000 0.033 0.040
BMI 0.012 0.003 0.078 3.383 0.001 0.005 0.018
Children 0.251 0.055 0.139 4.563 0.000 0.143 0.358
Age:Children -0.004 0.001 -0.066 -2.782 0.005 -0.006 -0.001
-----------------------------------------------------------------------------------------
insure_model1 <- insure_mlr_quant1$model # save model parameters to use in calculations
Interpreting Quantitative Interactions
Two CORRECT Interpretation(s) of this interaction:
The effect of age on insurance charges differs depending on how many children you have.
The effect of number of children on insurance charges differs depending your age.
Which interpretation the analyst emphasizes depends on the question being addressed.
Two Questions about Evaluating Interaction Terms:
How do we decide if ANY interaction term should stay in the model?
How do we attain estimates from a model with a qunatitative interaction?
Example: If a person is 48, has a BMI of 26 and has 3 children, what is the estimate of their insurance changes in dollars (NOT the LN of their charges)?
💥 Lecture 15 In-class Exercises - Q2💥
Session ID: bua345s25
Based on the R MLR output shown, is the interaction between Age and Number of Children useful in explaining differences in Insurance Charges?
Abridged Output
💥 Lecture 15 In-class Exercises - Q3💥
Session ID: bua345s25
Using this model, what is estimated insurance charge for 45 year old with a BMI of 26 and 2 children? Round to closest whole dollar.
Calculation can be done in R or by hand.
Age = 45
BMI = 26
Children = 2
Age*Children = 45*2 = 90
On the next slide I demonstrate how to do this in R using the saved model.
💥 Lecture 15 In-class Exercises - Q3💥
Age <-45# specify values using variable names in modelBMI <-26Children <-2(new_obs <-tibble(Age, BMI, Children)) # new_obs is 1 row dataset
# A tibble: 1 × 3
Age BMI Children
<dbl> <dbl> <dbl>
1 45 26 2
In the previous model, all included terms appear to be useful to the model. Is the interaction between Age and BMI also useful to the model?
Examine the model output to answer this question.
# save and print mlr model output(insure_mlr_quant2 <-ols_regress(ln_Charges ~ Age + BMI + Children + Age*Children + Age*BMI, data=insure))
Model Summary
----------------------------------------------------------------
R 0.555 RMSE 0.764
R-Squared 0.309 MSE 0.584
Adj. R-Squared 0.306 Coef. Var 8.419
Pred R-Squared 0.302 AIC 3091.922
MAE 0.626 SBC 3128.315
----------------------------------------------------------------
RMSE: Root Mean Square Error
MSE: Mean Square Error
MAE: Mean Absolute Error
AIC: Akaike Information Criteria
SBC: Schwarz Bayesian Criteria
ANOVA
-----------------------------------------------------------------------
Sum of
Squares DF Mean Square F Sig.
-----------------------------------------------------------------------
Regression 348.795 5 69.759 118.871 0.0000
Residual 781.679 1332 0.587
Total 1130.474 1337
-----------------------------------------------------------------------
Parameter Estimates
-----------------------------------------------------------------------------------------
model Beta Std. Error Std. Beta t Sig lower upper
-----------------------------------------------------------------------------------------
(Intercept) 6.785 0.315 21.567 0.000 6.168 7.402
Age 0.047 0.008 0.498 6.088 0.000 0.032 0.062
BMI 0.025 0.010 0.076 2.504 0.012 0.005 0.045
Children 0.249 0.055 0.139 4.540 0.000 0.142 0.357
Age:Children -0.004 0.001 -0.065 -2.750 0.006 -0.006 -0.001
Age:BMI 0.000 0.000 -0.033 -1.424 0.155 -0.001 0.000
-----------------------------------------------------------------------------------------
Goodness of Fit - Adjusted \(R^2\)
Previous slides show two possible models for these data. There are 63 possible models with these X variables and all two way interactions.
Today we will discuss Adjusted\(R^2\) as one option to compare different models (We will cover other model comparison measures soon).
Adjusted\(R^2\) adjusts \(R^2\) DOWNWARD by adding a penalty for additional predictor variables.
\(R^2\) (unadjusted) should NOT be used to compare MLR models.
Adding predictors will always increase \(R^2\), even if predictors are not useful.
Instead we adjust: We penalize model \(R^2\) for each additional variable added.
Adjusted \(R^2\) only increases if model fit improvement exceeds penalty for adding terms.
More about Goodness of Fit - Adjusted \(R^2\)
P-values for each term and change in Adjusted \(R^2\) often agree (but not always)
As P, number of predictors increases, the penalty increases.
Adjusted \(R^2 = 1 - \frac{(1-R^2)(n-1)}{n-P-1}\)
Students are not required to memorize this equation but you should understand what it is doing.
All Possible Models Sorted by Number of X variables
\(R^2\) ALWAYS increases as number of X variables increases.
Adjusted \(R^2\) ONLY increases if X variable is useful to model.
No. of Predictors
Predictors
\(R^2\)
Adjusted \(R^2\)
1
Age
0.2786
0.2781
1
Children
0.0260
0.0253
1
BMI
0.0176
0.0169
2
Age Children
0.2979
0.2969
2
Age BMI
0.2843
0.2832
3
Age BMI Children
0.3035
0.3019
4
Age BMI Children Age:Children
0.3075
0.3054
4
Age BMI Children Age:BMI
0.3046
0.3025
4
Age BMI Children BMI:Children
0.3036
0.3015
All Possible Models Sorted by Adj. \(R^2\)
\(R^2\) ALWAYS increases as number of X variables increases.
Adjusted \(R^2\) ONLY increases if X variable is useful to model.
No. of Predictors
Predictors
\(R^2\)
Adjusted \(R^2\)
4
Age BMI Children Age:Children
0.3075
0.3054
4
Age BMI Children Age:BMI
0.3046
0.3025
3
Age BMI Children
0.3035
0.3019
4
Age BMI Children BMI:Children
0.3036
0.3015
2
Age Children
0.2979
0.2969
2
Age BMI
0.2843
0.2832
1
Age
0.2786
0.2781
1
Children
0.0260
0.0253
1
BMI
0.0176
0.0169
Introduction to Model Selection
AKA Variable Selection
Adjusted \(R^2\) is good for comparing a few models.
In this case we knew that only 9 of the 63 possible models were reasonable.
If there are many possible reasonable models, we automate part of the selection process.
In MLR, the goal is to choose the simplest most accurate model, i.e. the ‘BEST’ set of independent variables
How do we decide which variables should be in our model?
There are many methods:
A popular method, Backward Elimination, can also be done manually in any software:
Start with all potential terms (including potential interaction terms) in the model and removes the least significant term one at time
Next Topics in Model Selection
Looking ahead, we’ll also cover:
Foreward Selection
Stepwise Selection
‘All Possible’ models - compared using additional measures
Common Practice: Try multiple methods to develop preliminary final model and then tweak as needed.
Steps for Backward Elimination
Examine Matrix of Scatterplots and histograms and determine if any transformations are needed to linearize relationships between continuous predictors and response variable.
Optional at this stage: Also examine correlation matrix to determine if some pairs of variables will be a concern
New term - Multicollinearity: If two predictors (X variables) in model have a correlation of 0.8 or higher, they can not both stay in the model because they are multicollinear and cause the model to be unstable.
Create a ‘saturated’ model with all potential predictor variables and interaction terms
This is subjective.
Be as transparent as possible in your how you decide on your full model.
Use ‘Backward Elimination’ to pare model down to a preliminary model
Steps for Backward Elimination
Examine predictors in preliminary model to confirm they are not too highly correlated with each other.
If two predictor variables have a correlation of 0.8 or greater, drop one of them (see above)
If model was modified in step 4, rerun model through Backward Elimination (not always needed).
Interpret final model.
Plan for Thursday and HW 7
In HW 7, you will examine the correlation matrix and then do simple versions of steps 3 and 6 of the model selection process.
Thursday, we will look at a couple of interesting models selection examples.
Example 1: Animals Data
Question: What factors affect a mammal’s sleep duration?**
Animals Data Notes:
Population was limited to animals under 1000 pounds (two elephant species excluded).
Natural log (LN) transformed variables were added to original data.
Observations with missing values are removed below
Working dataset has 49 observations (49 different species)
Preview of Lecture 16 Animals Data
Species
TotalSleep
BodyWt
LNBodyWt
BrainWt
LNBrainWt
LifeSpan
LNLifeSpan
Gestation
Predation
Exposure
Danger
Africangiantpouchedrat
8.3
1.00
0.00
6.6
1.89
4.5
1.50
42
3
1
3
Americanopossum
19.4
1.70
0.53
6.3
1.84
5.0
1.61
12
2
1
1
ArcticFox
12.5
3.39
1.22
44.5
3.80
14.0
2.64
60
1
1
1
Baboon
9.8
10.55
2.36
179.5
5.19
27.0
3.30
180
4
4
4
Bigbrownbat
19.7
0.02
-3.77
0.3
-1.20
19.0
2.94
35
1
1
1
Braziliantapir
6.2
160.00
5.08
169.0
5.13
30.4
3.41
392
4
5
4
Animals Data Dictionary - Description of Variables
Variable
Type
Description
Species
Nominal
Name of Species
TotalSleep
Quantitative
Total Sleep
BodyWt
Quantitative
Average Body Weight in kilograms
LNBodyWt
Quantitative
Natural Log of Body Weight
BrainWt
Quantitative
Average Brain Weight in grams
LNBrainWt
Quantitative
Natural Log of Brain Weight
LifeSpan
Quantitative
Maximum Life Span in years
LNLifeSpan
Quantitative
Natural Log of Life Span
Gestation
Quantitative
Gestation Time in days
Predation
Ordinal
Predation Index (1=least likely to be prey)
Exposure
Ordinal
Sleep Exposure Index (1=least exposed)
Danger
Ordinal
Overall Danger Index (1=least danger from other animals)
Key Points from Today
Regression modeling can be overwhelming
Automating part of the variable selection process is helpful.
Today we introduced Backward Elimination
Thursday we will look at a couple other model selection methods.
Try different methods and compare results.
Results from automated processes are preliminary.
HW 6 due on Wed. 3/5 (Grace Period extended until 3/7).
HW 7 will be posted by 3/7 and is due on Wed. 3/19.
Date of Quiz 2 has been changed to Tuesday, 4/1.
To submit an Engagement Question or Comment about material from Lecture 15: Submit it by midnight today (day of lecture).