In Lecture 14, categorical terms and interactions had a simple interpretation:
Each category has a unique SLR model:
The intercepts for different categories may or may not be different from baseline category(check P-values)
The slopes for different categories may or may not be different from baseline category (check P-values)
There are other kinds of interaction terms.
The first one we will discuss is an interaction between two QUANTITATIVE variables.
Example MLR with Quantitative Interaction Term
One POSSIBLE model for these data (there are many):
Code
```{r insurance model with quantitative interaction, echo=T}# save and print mlr model output(insure_mlr_quant1 <- ols_regress(ln_Charges ~ Age + BMI + Children + Age*Children, data=insure))insure_model1 <- insure_mlr_quant1$model # save model parameters to use in calculations```
Model Summary
----------------------------------------------------------------
R 0.555 RMSE 0.765
R-Squared 0.307 MSE 0.585
Adj. R-Squared 0.305 Coef. Var 8.423
Pred R-Squared 0.302 AIC 3091.957
MAE 0.626 SBC 3123.150
----------------------------------------------------------------
RMSE: Root Mean Square Error
MSE: Mean Square Error
MAE: Mean Absolute Error
AIC: Akaike Information Criteria
SBC: Schwarz Bayesian Criteria
ANOVA
-----------------------------------------------------------------------
Sum of
Squares DF Mean Square F Sig.
-----------------------------------------------------------------------
Regression 347.605 4 86.901 147.968 0.0000
Residual 782.869 1333 0.587
Total 1130.474 1337
-----------------------------------------------------------------------
Parameter Estimates
-----------------------------------------------------------------------------------------
model Beta Std. Error Std. Beta t Sig lower upper
-----------------------------------------------------------------------------------------
(Intercept) 7.195 0.127 56.858 0.000 6.947 7.443
Age 0.037 0.002 0.496 20.009 0.000 0.033 0.040
BMI 0.012 0.003 0.078 3.383 0.001 0.005 0.018
Children 0.251 0.055 0.139 4.563 0.000 0.143 0.358
Age:Children -0.004 0.001 -0.066 -2.782 0.005 -0.006 -0.001
-----------------------------------------------------------------------------------------
Interpreting Quantitative Interactions
Two CORRECT Interpretation(s) of this interaction:
The effect of age on insurance charges differs depending on how many children you have.
The effect of number of children on insurance charges differs depending your age.
Which interpretation the analyst emphasizes depends on the question being addressed.
Two Questions about Evaluating Interaction Terms:
How do we decide if ANY interaction term should stay in the model?
How do we attain estimates from a model with a qunatitative interaction?
Example: If a person is 48, has a BMI of 26 and has 3 children, what is the estimate of their insurance changes in dollars (NOT the LN of their charges)?
Lecture 15 In-class Exercises - Q2
Session ID: bua345s25
Based on the R MLR output shown, is the interaction between Age and Number of Children useful in explaining differences in Insurance Charges?
Abridged Output
Lecture 15 In-class Exercises - Q3
Session ID: bua345s25
Using this model, what is estimated insurance charge for 45 year old with a BMI of 26 and 2 children? Round to closest whole dollar.
Calculation can be done in R or by hand.
Age = 45
BMI = 26
Children = 2
Age*Children = 45*2 = 90
On the next slide I demonstrate how to do this in R using the saved model.
Lecture 15 In-class Exercises - Q3
Code
```{r create a dataset with 1 new observation, echo=T}Age <- 45 # specify values using variable names in modelBMI <- 26Children <- 2(new_obs <- tibble(Age, BMI, Children)) # new_obs is 1 row dataset(new_obs <- new_obs |> # add regression estimate mutate(est_ln_Charges = lm(insure_model1) |> predict(new_obs)))#(new_obs <- new_obs |> # back-transform estimate# mutate(est_Charges = ____(_____)) ```
# A tibble: 1 × 3
Age BMI Children
<dbl> <dbl> <dbl>
1 45 26 2
# A tibble: 1 × 4
Age BMI Children est_ln_Charges
<dbl> <dbl> <dbl> <dbl>
1 45 26 2 9.31
Lecture 15 In-class Exercises - Q4
In the previous model, all included terms appear to be useful to the model. Is the interaction between Age and BMI also useful to the model?
Examine the model output to answer this question.
Code
```{r insure model with two quant interactions, echo=T}# save and print mlr model output(insure_mlr_quant2 <- ols_regress(ln_Charges ~ Age + BMI + Children + Age*Children + Age*BMI, data=insure))```
Model Summary
----------------------------------------------------------------
R 0.555 RMSE 0.764
R-Squared 0.309 MSE 0.584
Adj. R-Squared 0.306 Coef. Var 8.419
Pred R-Squared 0.302 AIC 3091.922
MAE 0.626 SBC 3128.315
----------------------------------------------------------------
RMSE: Root Mean Square Error
MSE: Mean Square Error
MAE: Mean Absolute Error
AIC: Akaike Information Criteria
SBC: Schwarz Bayesian Criteria
ANOVA
-----------------------------------------------------------------------
Sum of
Squares DF Mean Square F Sig.
-----------------------------------------------------------------------
Regression 348.795 5 69.759 118.871 0.0000
Residual 781.679 1332 0.587
Total 1130.474 1337
-----------------------------------------------------------------------
Parameter Estimates
-----------------------------------------------------------------------------------------
model Beta Std. Error Std. Beta t Sig lower upper
-----------------------------------------------------------------------------------------
(Intercept) 6.785 0.315 21.567 0.000 6.168 7.402
Age 0.047 0.008 0.498 6.088 0.000 0.032 0.062
BMI 0.025 0.010 0.076 2.504 0.012 0.005 0.045
Children 0.249 0.055 0.139 4.540 0.000 0.142 0.357
Age:Children -0.004 0.001 -0.065 -2.750 0.006 -0.006 -0.001
Age:BMI 0.000 0.000 -0.033 -1.424 0.155 -0.001 0.000
-----------------------------------------------------------------------------------------
Goodness of Fit - Adjusted \(R^2\)
Previous slides show two possible models for these data. There are 63 possible models with these X variables and all two way interactions.
Today we will discuss Adjusted\(R^2\) as one option to compare different models (We will cover other model comparison measures soon).
Adjusted\(R^2\) adjusts \(R^2\) DOWNWARD by adding a penalty for additional predictor variables.
\(R^2\) (unadjusted) should NOT be used to compare MLR models.
Adding predictors will always increase \(R^2\), even if predictors are not useful.
Instead we adjust: We penalize model \(R^2\) for each additional variable added.
Adjusted \(R^2\) only increases if model fit improvement exceeds penalty for adding terms.
More about Goodness of Fit - Adjusted \(R^2\)
P-values for each term and change in Adjusted \(R^2\) often agree (but not always)
As P, number of predictors increases, the penalty increases.
Adjusted \(R^2 = 1 - \frac{(1-R^2)(n-1)}{n-P-1}\)
Students are not required to memorize this equation but you should understand what it is doing.
All Possible Models Sorted by Number of X variables
\(R^2\) ALWAYS increases as number of X variables increases.
Adjusted \(R^2\) ONLY increases if X variable is useful to model.
No. of Predictors
Predictors
\(R^2\)
Adjusted \(R^2\)
1
Age
0.2786
0.2781
1
Children
0.0260
0.0253
1
BMI
0.0176
0.0169
2
Age Children
0.2979
0.2969
2
Age BMI
0.2843
0.2832
3
Age BMI Children
0.3035
0.3019
4
Age BMI Children Age:Children
0.3075
0.3054
4
Age BMI Children Age:BMI
0.3046
0.3025
4
Age BMI Children BMI:Children
0.3036
0.3015
All Possible Models Sorted by Adj. \(R^2\)
\(R^2\) ALWAYS increases as number of X variables increases.
Adjusted \(R^2\) ONLY increases if X variable is useful to model.
No. of Predictors
Predictors
\(R^2\)
Adjusted \(R^2\)
4
Age BMI Children Age:Children
0.3075
0.3054
4
Age BMI Children Age:BMI
0.3046
0.3025
3
Age BMI Children
0.3035
0.3019
4
Age BMI Children BMI:Children
0.3036
0.3015
2
Age Children
0.2979
0.2969
2
Age BMI
0.2843
0.2832
1
Age
0.2786
0.2781
1
Children
0.0260
0.0253
1
BMI
0.0176
0.0169
Introduction to Model Selection
AKA Variable Selection
Adjusted \(R^2\) is good for comparing a few models.
In this case we knew that only 9 of the 63 possible models were reasonable.
If there are many possible reasonable models, we automate part of the selection process.
In MLR, the goal is to choose the simplest most accurate model, i.e. the ‘BEST’ set of independent variables
How do we decide which variables should be in our model?
There are many methods:
A popular method, Backward Elimination, can also be done manually in any software:
Start with all potential terms (including potential interaction terms) in the model and removes the least significant term one at time
Next Topics in Model Selection
Looking ahead, we’ll also cover:
Foreward Selection
Stepwise Selection
‘All Possible’ models - compared using additional measures
Common Practice: Try multiple methods to develop preliminary final model and then tweak as needed.
Steps for Backward Elimination
Examine Matrix of Scatterplots and histograms and determine if any transformations are needed to linearize relationships between continuous predictors and response variable.
Optional at this stage: Also examine correlation matrix to determine if some pairs of variables will be a concern
New term - Multicollinearity: If two predictors (X variables) in model have a correlation of 0.8 or higher, they can not both stay in the model because they are multicollinear and cause the model to be unstable.
Create a ‘saturated’ model with all potential predictor variables and interaction terms
This is subjective.
Be as transparent as possible in your how you decide on your full model.
Use ‘Backward Elimination’ to pare model down to a preliminary model
Steps for Backward Elimination
Examine predictors in preliminary model to confirm they are not too highly correlated with each other.
If two predictor variables have a correlation of 0.8 or greater, drop one of them (see above)
If model was modified in step 4, rerun model through Backward Elimination (not always needed).
Interpret final model.
Plan for Thursday and HW 7
In HW 7, you will examine the correlation matrix and then do simple versions of steps 3 and 6 of the model selection process.
Thursday, we will look at a couple of interesting models selection examples.
Example 1: Animals Data
Question: What factors affect a mammal’s sleep duration?**
Animals Data Notes:
Population was limited to animals under 1000 pounds (two elephant species excluded).
Natural log (LN) transformed variables were added to original data.
Observations with missing values are removed below
Working dataset has 49 observations (49 different species)
Preview of Lecture 16 Animals Data
Species
TotalSleep
BodyWt
LNBodyWt
BrainWt
LNBrainWt
LifeSpan
LNLifeSpan
Gestation
Predation
Exposure
Danger
Africangiantpouchedrat
8.3
1.00
0.00
6.6
1.89
4.5
1.50
42
3
1
3
Americanopossum
19.4
1.70
0.53
6.3
1.84
5.0
1.61
12
2
1
1
ArcticFox
12.5
3.39
1.22
44.5
3.80
14.0
2.64
60
1
1
1
Baboon
9.8
10.55
2.36
179.5
5.19
27.0
3.30
180
4
4
4
Bigbrownbat
19.7
0.02
-3.77
0.3
-1.20
19.0
2.94
35
1
1
1
Braziliantapir
6.2
160.00
5.08
169.0
5.13
30.4
3.41
392
4
5
4
Animals Data Dictionary - Description of Variables
Variable
Type
Description
Species
Nominal
Name of Species
TotalSleep
Quantitative
Total Sleep
BodyWt
Quantitative
Average Body Weight in kilograms
LNBodyWt
Quantitative
Natural Log of Body Weight
BrainWt
Quantitative
Average Brain Weight in grams
LNBrainWt
Quantitative
Natural Log of Brain Weight
LifeSpan
Quantitative
Maximum Life Span in years
LNLifeSpan
Quantitative
Natural Log of Life Span
Gestation
Quantitative
Gestation Time in days
Predation
Ordinal
Predation Index (1=least likely to be prey)
Exposure
Ordinal
Sleep Exposure Index (1=least exposed)
Danger
Ordinal
Overall Danger Index (1=least danger from other animals)
Key Points from Today
Regression modeling can be overwhelming
Automating part of the variable selection process is helpful.
Today we introduced Backward Elimination
Thursday we will look at a couple other model selection methods.
Try different methods and compare results.
Results from automated processes are preliminary.
HW 6 due on Wed. 3/5 (Grace Period extended until 3/7).
HW 7 will be posted by 3/7 and is due on Wed. 3/19.
Date of Quiz 2 has been changed to Tuesday, 4/1.
To submit an Engagement Question or Comment about material from Lecture 15: Submit it by midnight today (day of lecture).
Source Code
---title: "BUA 345 - Lectures 15"subtitle: "Introduction to Model Selection"author: "Penelope Pooler Eisenbies"date: last-modifiedlightbox: truetoc: truetoc-depth: 3toc-location: lefttoc-title: "Table of Contents"toc-expand: 1format: html: code-line-numbers: true code-fold: true code-tools: trueexecute: echo: fenced---## Housekeeping```{r setup, echo=FALSE, warning=F, message=F, include=F}#| include: false# this line specifies options for default options for all R Chunksknitr::opts_chunk$set(echo=F)# suppress scientific notationoptions(scipen=100)# install helper package that loads and installs other packages, if neededif (!require("pacman")) install.packages("pacman", repos = "http://lib.stat.cmu.edu/R/CRAN/")# install and load required packagespacman::p_load(pacman,tidyverse, magrittr, olsrr, shadowtext, mapproj, knitr, kableExtra, countrycode, usdata, maps, RColorBrewer, gridExtra, ggthemes, gt, tools, ggiraphExtra, viridis)# verify packages# p_loaded()```**HW 6 was due 3/5/2025** - 2 day grace period::: nonincremental- Demo videos were posted on Sunday morning:::**HW 7 will be posted on Thursday (3/6) is due on Wednesday, 3/19**.**Quiz 2 will be on 4/1/2025** - Date has changed and syllabus has been updated.### Today's plan- Review of $R^2$ and Adjusted $R^2$ - Selecting a model based on Adjusted $R^2$- Explanation of Quantitative Interactions- Building a full model- Backward Elimination for Model Selection::: fragment**In-class Polling (Session ID: bua345s25)**:::## ### Lecture 15 In-class Exercises - Q1Recall the Actors and Athletes data that we examined in Lecture 14.```{r celeb_prof regression, echo=T}# import and examine celeb profession datasetceleb_prof <- read_csv("data/celeb_prof.csv", show_col_types=F) # formatted regression output - saved and printed to screen(celeb_interaction_ols<- ols_regress(Earnings ~ Age + Profession + Age*Profession, data=celeb_prof, iterm = T))```## ### Lecture 15 In-class Exercises - Q1***Session ID: bua345s25***#### Abridged Output{fig-align="center"}Review Question **What is the slope for the linear model for Athletes?**Round answer to two decimal places.Hint: To answer this question, you combine two terms:- the baseline slope term for `Age`: 1.824- the difference in slope `Athlete`: -5.063- Slope for Athletes = baseline + difference = `____`## ### Review of Regression Terms $R^2$ and Adjusted $R^2$- **R** is the correlation coefficient, $R_{XY}$- $R^2$ is $R_{XY}^2$- $R^2$ is also called coefficient of determination- **Meaning of** $R^2$ in SLR: Proportion of variability in y explained by X- **Adjusted** $R^2$ adjusts $R^2$\> for number of explanatory (X) variables in model. - Meaning of **Adjusted** $R^2$ in MLR is a little less specific but similar to $R^2$## ### Example of$R^2$ Interpretation**Import and Examine Insurance Data**```{r import and examine insure_L15 data, echo=T}insure <- read_csv("data/insure_L15.csv", show_col_types=F) # importinsure <- insure |> # create log transformed variable mutate(ln_Charges = log(Charges)) head(insure) |> kable()```## ### Examine Histograms of Charges and ln_Charges```{r creatingand formatting histograms of Charges and ln_Charges, echo=F, fig.width=15, fig.height=8, fig.align='center'}# histogram of original Charges datahist_Charges <- insure |> ggplot() + geom_histogram(aes(x=Charges), color="darkblue", fill="lightblue") + labs(x="Insurance Charges", y="Frequency") + theme_classic() + theme(axis.title = element_text(size=18), axis.text = element_text(size=15), plot.background = element_rect(colour = "darkgrey", fill=NA, size=2))# histogram of ln_Chargeshist_ln_Charges <- insure |> ggplot() + geom_histogram(aes(x=ln_Charges), color="darkgreen", fill="lightgreen") + labs(x="Natural Log of Insurance Charges", y="Frequency") + theme_classic() + theme(axis.title = element_text(size=18), axis.text = element_text(size=15), plot.background = element_rect(colour = "darkgrey", fill=NA, size=2))# display of these two histograms side by sidegrid.arrange(hist_Charges, hist_ln_Charges, ncol=2)```## SLR model - Predictor (X) Variable: Age```{r slr model for insurance data}# save and print slr model(insure_slr <- ols_regress(ln_Charges ~ Age, data=insure))```## ### More about $R^2$ and How It's Calculated::: fragment{fig-align="center"}:::- R = 0.528 which indicates a moderate correlation between `Age` and `ln_Charges` (Natural Log of Insurance Charges)- $R^2$ = 0.279 which means that approximately 28% of the variability in `ln_charges` (Natural Log of Insurance Charges) is explained by `Age`.## ### More about $R^2$ and How It's Calculated::: fragment{fig-align="center"}:::- $R^2$ can also be calculated from the Sum of Squares output: - $SS_{TOT}$ (Total. Sum of Squares): **1130.474 (Total variability in Y)** - $SS_{REG}$ (Regression. Sum of Squares): **314.960 (Variability in Y explained by model)** - $SS_{RES}$ (Residual Sum of Squares): **815.514 (Variability in Y NOT explained by model)** - $R^2$ = $SS_{REG}$ / $SS_{TOT}$ = 314.96/1130.474 = 0.279## ### MLR with Quantitative Interaction Term- In Lecture 14, categorical terms and interactions had a simple interpretation: - Each category has a unique SLR model: - The intercepts for different categories may or may not be different from baseline category(check P-values) - The slopes for different categories may or may not be different from baseline category (check P-values)<br>- There are other kinds of interaction terms.- The first one we will discuss is an interaction between two QUANTITATIVE variables.## ### Example MLR with Quantitative Interaction TermOne POSSIBLE model for these data (there are many):```{r insurance model with quantitative interaction, echo=T}# save and print mlr model output(insure_mlr_quant1 <- ols_regress(ln_Charges ~ Age + BMI + Children + Age*Children, data=insure))insure_model1 <- insure_mlr_quant1$model # save model parameters to use in calculations```## ### Interpreting Quantitative Interactions- Two CORRECT Interpretation(s) of this interaction: 1. The effect of age on insurance charges differs depending on how many children you have. 2. The effect of number of children on insurance charges differs depending your age.- Which interpretation the analyst emphasizes depends on the question being addressed.<br>- Two Questions about Evaluating Interaction Terms: - How do we decide if ANY interaction term should stay in the model? - How do we attain estimates from a model with a qunatitative interaction? Example: If a person is 48, has a BMI of 26 and has 3 children, what is the estimate of their insurance changes in dollars (NOT the LN of their charges)?## ### Lecture 15 In-class Exercises - Q2***Session ID: bua345s25***Based on the R MLR output shown, is the interaction between Age and Number of Children useful in explaining differences in Insurance Charges?<br>### Abridged Output{fig-align="center"}## ### Lecture 15 In-class Exercises - Q3***Session ID: bua345s25***Using this model, what is estimated insurance charge for 45 year old with a BMI of 26 and 2 children? Round to closest whole dollar.- Calculation can be done in R or by hand. - `Age = 45` - `BMI = 26` - `Children = 2` - `Age*Children = 45*2 = 90`- On the next slide I demonstrate how to do this in R using the saved model.## ### Lecture 15 In-class Exercises - Q3```{r create a dataset with 1 new observation, echo=T}Age <- 45 # specify values using variable names in modelBMI <- 26Children <- 2(new_obs <- tibble(Age, BMI, Children)) # new_obs is 1 row dataset(new_obs <- new_obs |> # add regression estimate mutate(est_ln_Charges = lm(insure_model1) |> predict(new_obs)))#(new_obs <- new_obs |> # back-transform estimate# mutate(est_Charges = ____(_____)) ```## ### Lecture 15 In-class Exercises - Q4In the previous model, all included terms appear to be useful to the model. Is the interaction between Age and BMI also useful to the model?Examine the model output to answer this question.```{r insure model with two quant interactions, echo=T}# save and print mlr model output(insure_mlr_quant2 <- ols_regress(ln_Charges ~ Age + BMI + Children + Age*Children + Age*BMI, data=insure))```## ### Goodness of Fit - Adjusted $R^2$- Previous slides show two possible models for these data. ***There are 63 possible models*** with these X variables and all two way interactions.- Today we will discuss **Adjusted** $R^2$ as one option to compare different models (We will cover other model comparison measures soon). - **Adjusted** $R^2$ adjusts $R^2$ DOWNWARD by adding a penalty for additional predictor variables. - $R^2$ (unadjusted) should NOT be used to compare MLR models. - Adding predictors will always increase $R^2$, even if predictors are not useful. - Instead we adjust: We penalize model $R^2$ for each additional variable added. - Adjusted $R^2$ only increases if model fit improvement exceeds penalty for adding terms.## ### More about Goodness of Fit - Adjusted $R^2$- P-values for each term and change in Adjusted $R^2$ often agree (but not always)- As P, number of predictors increases, the penalty increases.- Adjusted $R^2 = 1 - \frac{(1-R^2)(n-1)}{n-P-1}$- **Students are not required to memorize this equation but you should understand what it is doing.**```{r code to examine all possible insurance models, echo=F}insure_full <- lm(ln_Charges ~ Age + BMI + Children + Age*Children + # full model specified Age*BMI + BMI*Children, data=insure)insure_all_models <- ols_step_all_possible(insure_full) # all possible modelsinsure_all_subset <- insure_all_models$result |> as_tibble() |> # specify model subset dplyr::select(mindex, n, predictors, rsquare, adjr) |> # determined based on filter(mindex %in% c(1,5,6,8,10,22,42,44,47)) # understanding of datainsure_subset_print1 <- insure_all_subset |> # format table for printing and print dplyr::select(n, predictors, rsquare, adjr) |> # select useful columns mutate(rsquare = rsquare |> round(4), # round values to 4 decimal places adjr = adjr |> round(4)) |> arrange(desc(adjr)) |> # reorder by adjust rsquared rename(`No. of Predictors`= n, # rename table columns Predictors = predictors, `$R^2$` = rsquare, `Adjusted $R^2$` = adjr) |> kable()``````{r create additional kable table surted by number of predictors, echo=F}insure_subset_print2 <- insure_all_subset |> # format table for printing and print select(n, predictors, rsquare, adjr) |> # select useful columns mutate(rsquare = rsquare |> round(4), # round values to 4 decimal places adjr = adjr |> round(4)) |> rename(`No. of Predictors`= n, # rename table columns Predictors = predictors, `$R^2$` = rsquare, `Adjusted $R^2$` = adjr) |> kable()```## #### All Possible Models Sorted by Number of X variables$R^2$ ALWAYS increases as number of X variables increases.Adjusted $R^2$ ONLY increases if X variable is useful to model.```{r table sorted by num of predictors, echo=F}insure_subset_print2```## ### All Possible Models Sorted by Adj. $R^2$$R^2$ ALWAYS increases as number of X variables increases.Adjusted $R^2$ ONLY increases if X variable is useful to model.```{r table sorted by num of adjr2, echo=F}insure_subset_print1```## ### Introduction to Model Selection**AKA Variable Selection**- Adjusted $R^2$ is good for comparing a few models.- In this case we knew that only 9 of the 63 possible models were reasonable.- If there are many possible reasonable models, we automate part of the selection process.- In MLR, the goal is to choose the simplest most accurate model, i.e. the 'BEST' set of independent variables - How do we decide which variables should be in our model? - There are many methods: - A popular method, **Backward Elimination**, can also be done manually in any software: - Start with all potential terms (including potential interaction terms) in the model and removes the least significant term one at time## ### Next Topics in Model Selection- Looking ahead, we'll also cover: - Foreward Selection - Stepwise Selection - 'All Possible' models - compared using additional measures- **Common Practice:** Try multiple methods to develop preliminary final model and then tweak as needed.## ### Steps for Backward Elimination1. Examine Matrix of Scatterplots and histograms and determine if any transformations are needed to linearize relationships between continuous predictors and response variable. - Optional at this stage: Also examine correlation matrix to determine if some pairs of variables will be a concern - **New term - Multicollinearity:** If two predictors (X variables) in model have a correlation of 0.8 or higher, they can not both stay in the model because they are multicollinear and cause the model to be unstable.2. Create a 'saturated' model with all potential predictor variables and interaction terms - This is subjective. - Be as transparent as possible in your how you decide on your full model.3. Use **'Backward Elimination'** to pare model down to a preliminary model## Steps for Backward Elimination4. Examine predictors in preliminary model to confirm they are not too highly correlated with each other. - If two predictor variables have a correlation of 0.8 or greater, drop one of them (see above)5. If model was modified in step 4, rerun model through Backward Elimination (not always needed).6. Interpret final model.## ### Plan for Thursday and HW 7- In HW 7, you will examine the correlation matrix and then do simple versions of steps 3 and 6 of the model selection process.- Thursday, we will look at a couple of interesting models selection examples.::: fragment**Example 1: Animals Data**:::- Question: What factors affect a mammal's sleep duration?\*\*- Animals Data Notes: - Population was limited to animals under 1000 pounds (two elephant species excluded). - Natural log (LN) transformed variables were added to original data. - Observations with missing values are removed below - Working dataset has 49 observations (49 different species)## ### Preview of Lecture 16 Animals Data```{r import data and remove missing values}# import and examine dataanimals <- read_csv("data/animals.csv", show_col_types=F) |> filter(!is.na(LifeSpan) & !is.na(Gestation)) head(animals) |> kable()```## ### Animals Data Dictionary - Description of Variables```{r animals data dictionary table, echo=F}Variable <- names(animals)Type <- c("Nominal", rep("Quantitative", 8), rep("Ordinal", 3))Description <- c("Name of Species", "Total Sleep", "Average Body Weight in kilograms", "Natural Log of Body Weight", "Average Brain Weight in grams", "Natural Log of Brain Weight", "Maximum Life Span in years", "Natural Log of Life Span", "Gestation Time in days", "Predation Index (1=least likely to be prey)", "Sleep Exposure Index (1=least exposed)", "Overall Danger Index (1=least danger from other animals)")(animal_data_dictionary = tibble (Variable, Type, Description) |> kable())```## ### Key Points from Today- Regression modeling can be overwhelming - Automating part of the variable selection process is helpful. - Today we introduced Backward Elimination - Thursday we will look at a couple other model selection methods. - Try different methods and compare results. - Results from automated processes are preliminary.- HW 6 due on Wed. 3/5 (Grace Period extended until 3/7).- HW 7 will be posted by 3/7 and is due on Wed. 3/19.- **Date of Quiz 2 has been changed to Tuesday, 4/1.**::: fragment**To submit an Engagement Question or Comment about material from Lecture 15:** Submit it by midnight today (day of lecture).:::