---
title: "BUA 345 - Lectures 15"
subtitle: "Introduction to Model Selection"
author: "Penelope Pooler Eisenbies"
date: last-modified
lightbox: true
toc: true
toc-depth: 3
toc-location: left
toc-title: "Table of Contents"
toc-expand: 1
format:
html:
code-line-numbers: true
code-fold: true
code-tools: true
execute:
echo: fenced
---
## Housekeeping
```{r setup, echo=FALSE, warning=F, message=F, include=F}
#| include: false
# this line specifies options for default options for all R Chunks
knitr::opts_chunk$set(echo=F)
# suppress scientific notation
options(scipen=100)
# install helper package that loads and installs other packages, if needed
if (!require("pacman")) install.packages("pacman", repos = "http://lib.stat.cmu.edu/R/CRAN/")
# install and load required packages
pacman::p_load(pacman,tidyverse, magrittr, olsrr,
shadowtext, mapproj, knitr, kableExtra,
countrycode, usdata, maps, RColorBrewer,
gridExtra, ggthemes, gt, tools,
ggiraphExtra, viridis)
# verify packages
# p_loaded()
```
**HW 6 is due 3/4/2026** - 2 day grace period
::: nonincremental
- Demo videos were posted on Sunday morning
:::
**HW 7 will be posted on Thursday (3/5) is due on Wednesday, 3/18**.
**Quiz 2 will be on 3/26/2025**
### Today's plan
- Review of $R^2$ and Adjusted $R^2$
- Selecting a model based on Adjusted $R^2$
- Explanation of Quantitative Interactions
- Building a full model
- Backward Elimination for Model Selection
##
### Lecture 15 In-class Exercises - Q1
[***Poll Everywhere***](https://pollev.com/penelopepoolereisenbies685){target="_blank"} - My User Name: **penelopepoolereisenbies685**
Recall the Actors and Athletes data that we examined in Lecture 14.
```{r celeb_prof regression, echo=T}
# import and examine celeb profession dataset
celeb_prof <- read_csv("data/celeb_prof.csv", show_col_types=F)
# formatted regression output - saved and printed to screen
(celeb_interaction_ols<- ols_regress(Earnings ~ Age + Profession + Age*Profession,
data=celeb_prof, iterm = T))
```
##
### Lecture 15 In-class Exercises - Q1
[***Poll Everywhere***](https://pollev.com/penelopepoolereisenbies685){target="_blank"} - My User Name: **penelopepoolereisenbies685**
#### Abridged Output
{fig-align="center"}
Review Question **What is the slope for the linear model for Athletes?**
Round answer to two decimal places.
Hint: To answer this question, you combine two terms:
- the baseline slope term for `Age`: 1.824
- the difference in slope `Athlete`: -5.063
- Slope for Athletes = baseline + difference = `____`
##
### Review of Regression Terms $R^2$ and Adjusted $R^2$
- **R** is the correlation coefficient, $R_{XY}$
- $R^2$ is $R_{XY}^2$
- $R^2$ is also called coefficient of determination
- **Meaning of** $R^2$ in SLR: Proportion of variability in y explained by X
- **Adjusted** $R^2$ adjusts $R^2$\> for number of explanatory (X) variables in model.
- Meaning of **Adjusted** $R^2$ in MLR is a little less specific but similar to $R^2$
##
### Example of$R^2$ Interpretation
**Import and Examine Insurance Data**
```{r import and examine insure_L15 data, echo=T}
insure <- read_csv("data/insure_L15.csv", show_col_types=F) # import
insure <- insure |> # create log transformed variable
mutate(ln_Charges = log(Charges))
head(insure) |> kable()
```
##
### Examine Histograms of Charges and ln_Charges
```{r creatingand formatting histograms of Charges and ln_Charges, echo=F, fig.width=15, fig.height=8, fig.align='center'}
# histogram of original Charges data
hist_Charges <- insure |>
ggplot() +
geom_histogram(aes(x=Charges), color="darkblue", fill="lightblue") +
labs(x="Insurance Charges", y="Frequency") +
theme_classic() +
theme(axis.title = element_text(size=18),
axis.text = element_text(size=15),
plot.background = element_rect(colour = "darkgrey", fill=NA, size=2))
# histogram of ln_Charges
hist_ln_Charges <- insure |>
ggplot() +
geom_histogram(aes(x=ln_Charges), color="darkgreen", fill="lightgreen") +
labs(x="Natural Log of Insurance Charges", y="Frequency") +
theme_classic() +
theme(axis.title = element_text(size=18),
axis.text = element_text(size=15),
plot.background = element_rect(colour = "darkgrey", fill=NA, size=2))
# display of these two histograms side by side
grid.arrange(hist_Charges, hist_ln_Charges, ncol=2)
```
## SLR model - Predictor (X) Variable: Age
```{r slr model for insurance data}
# save and print slr model
(insure_slr <- ols_regress(ln_Charges ~ Age, data=insure))
```
##
### More about $R^2$ and How It's Calculated
::: fragment
{fig-align="center"}
:::
- R = 0.528 which indicates a moderate correlation between `Age` and `ln_Charges` (Natural Log of Insurance Charges)
- $R^2$ = 0.279 which means that approximately 28% of the variability in `ln_charges` (Natural Log of Insurance Charges) is explained by `Age`.
##
### More about $R^2$ and How It's Calculated
::: fragment
{fig-align="center"}
:::
- $R^2$ can also be calculated from the Sum of Squares output:
- $SS_{TOT}$ (Total. Sum of Squares): **1130.474 (Total variability in Y)**
- $SS_{REG}$ (Regression. Sum of Squares): **314.960 (Variability in Y explained by model)**
- $SS_{RES}$ (Residual Sum of Squares): **815.514 (Variability in Y NOT explained by model)**
- $R^2$ = $SS_{REG}$ / $SS_{TOT}$ = 314.96/1130.474 = 0.279
##
### MLR with Quantitative Interaction Term
- In Lecture 14, categorical terms and interactions had a simple interpretation:
- Each category has a unique SLR model:
- The intercepts for different categories may or may not be different from baseline category(check P-values)
- The slopes for different categories may or may not be different from baseline category (check P-values)
<br>
- There are other kinds of interaction terms.
- The first one we will discuss is an interaction between two QUANTITATIVE variables.
##
### Example MLR with Quantitative Interaction Term
One POSSIBLE model for these data (there are many):
```{r insurance model with quantitative interaction, echo=T}
# save and print mlr model output
(insure_mlr_quant1 <- ols_regress(ln_Charges ~ Age + BMI + Children + Age*Children, data=insure))
insure_model1 <- insure_mlr_quant1$model # save model parameters to use in calculations
```
##
### Interpreting Quantitative Interactions
- Two CORRECT Interpretation(s) of this interaction:
1. The effect of age on insurance charges differs depending on how many children you have.
2. The effect of number of children on insurance charges differs depending your age.
- Which interpretation the analyst emphasizes depends on the question being addressed.
<br>
- Two Questions about Evaluating Interaction Terms:
- How do we decide if ANY interaction term should stay in the model?
- How do we attain estimates from a model with a qunatitative interaction?
Example: If a person is 48, has a BMI of 26 and has 3 children, what is the estimate of their insurance changes in dollars (NOT the LN of their charges)?
##
### Lecture 15 In-class Exercises - Q2
[***Poll Everywhere***](https://pollev.com/penelopepoolereisenbies685){target="_blank"} - My User Name: **penelopepoolereisenbies685**
Based on the R MLR output shown, is the interaction between Age and Number of Children useful in explaining differences in Insurance Charges?
<br>
### Abridged Output
{fig-align="center"}
##
### Lecture 15 In-class Exercises - Q3
[***Poll Everywhere***](https://pollev.com/penelopepoolereisenbies685){target="_blank"} - My User Name: **penelopepoolereisenbies685**
Using this model, what is estimated insurance charge for 45 year old with a BMI of 26 and 2 children? Round to closest whole dollar.
- Calculation can be done in R or by hand.
- `Age = 45`
- `BMI = 26`
- `Children = 2`
- `Age*Children = 45*2 = 90`
- On the next slide I demonstrate how to do this in R using the saved model.
##
### Lecture 15 In-class Exercises - Q3
[***Poll Everywhere***](https://pollev.com/penelopepoolereisenbies685){target="_blank"} - My User Name: **penelopepoolereisenbies685**
```{r create a dataset with 1 new observation, echo=T}
Age <- 45 # specify values using variable names in model
BMI <- 26
Children <- 2
(new_obs <- tibble(Age, BMI, Children)) # new_obs is 1 row dataset
(new_obs <- new_obs |> # add regression estimate
mutate(est_ln_Charges = lm(insure_model1) |> predict(new_obs)))
#(new_obs <- new_obs |> # back-transform estimate
# mutate(est_Charges = ____(_____))
```
##
### Lecture 15 In-class Exercises - Q4
[***Poll Everywhere***](https://pollev.com/penelopepoolereisenbies685){target="_blank"} - My User Name: **penelopepoolereisenbies685**
In the previous model, all included terms appear to be useful to the model. Is the interaction between Age and BMI also useful to the model?
Examine the model output to answer this question.
```{r insure model with two quant interactions, echo=T}
# save and print mlr model output
(insure_mlr_quant2 <- ols_regress(ln_Charges ~ Age + BMI + Children +
Age*Children + Age*BMI, data=insure))
```
##
### Goodness of Fit - Adjusted $R^2$
- Previous slides show two possible models for these data. ***There are 63 possible models*** with these X variables and all two way interactions.
- Today we will discuss **Adjusted** $R^2$ as one option to compare different models (We will cover other model comparison measures soon).
- **Adjusted** $R^2$ adjusts $R^2$ DOWNWARD by adding a penalty for additional predictor variables.
- $R^2$ (unadjusted) should NOT be used to compare MLR models.
- Adding predictors will always increase $R^2$, even if predictors are not useful.
- Instead we adjust: We penalize model $R^2$ for each additional variable added.
- Adjusted $R^2$ only increases if model fit improvement exceeds penalty for adding terms.
##
### More about Goodness of Fit - Adjusted $R^2$
- P-values for each term and change in Adjusted $R^2$ often agree (but not always)
- As P, number of predictors increases, the penalty increases.
- Adjusted $R^2 = 1 - \frac{(1-R^2)(n-1)}{n-P-1}$
- **Students are not required to memorize this equation but you should understand what it is doing.**
```{r code to examine all possible insurance models, echo=F}
insure_full <- lm(ln_Charges ~ Age + BMI + Children + Age*Children + # full model specified
Age*BMI + BMI*Children, data=insure)
insure_all_models <- ols_step_all_possible(insure_full) # all possible models
insure_all_subset <- insure_all_models$result |> as_tibble() |> # specify model subset
dplyr::select(mindex, n, predictors, rsquare, adjr) |> # determined based on
filter(mindex %in% c(1,5,6,8,10,22,42,44,47)) # understanding of data
insure_subset_print1 <- insure_all_subset |> # format table for printing and print
dplyr::select(n, predictors, rsquare, adjr) |> # select useful columns
mutate(rsquare = rsquare |> round(4), # round values to 4 decimal places
adjr = adjr |> round(4)) |>
arrange(desc(adjr)) |> # reorder by adjust rsquared
rename(`No. of Predictors`= n, # rename table columns
Predictors = predictors,
`$R^2$` = rsquare,
`Adjusted $R^2$` = adjr) |>
kable()
```
```{r create additional kable table surted by number of predictors, echo=F}
insure_subset_print2 <- insure_all_subset |> # format table for printing and print
select(n, predictors, rsquare, adjr) |> # select useful columns
mutate(rsquare = rsquare |> round(4), # round values to 4 decimal places
adjr = adjr |> round(4)) |>
rename(`No. of Predictors`= n, # rename table columns
Predictors = predictors,
`$R^2$` = rsquare,
`Adjusted $R^2$` = adjr) |>
kable()
```
##
#### All Possible Models Sorted by Number of X variables
$R^2$ ALWAYS increases as number of X variables increases.
Adjusted $R^2$ ONLY increases if X variable is useful to model.
```{r table sorted by num of predictors, echo=F}
insure_subset_print2
```
##
### All Possible Models Sorted by Adj. $R^2$
$R^2$ ALWAYS increases as number of X variables increases.
Adjusted $R^2$ ONLY increases if X variable is useful to model.
```{r table sorted by num of adjr2, echo=F}
insure_subset_print1
```
##
### Introduction to Model Selection
**AKA Variable Selection**
- Adjusted $R^2$ is good for comparing a few models.
- In this case we knew that only 9 of the 63 possible models were reasonable.
- If there are many possible reasonable models, we automate part of the selection process.
- In MLR, the goal is to choose the simplest most accurate model, i.e. the 'BEST' set of independent variables
- How do we decide which variables should be in our model?
- There are many methods:
- A popular method, **Backward Elimination**, can also be done manually in any software:
- Start with all potential terms (including potential interaction terms) in the model and removes the least significant term one at time
##
### Next Topics in Model Selection
- Looking ahead, we'll also cover:
- Foreward Selection
- Stepwise Selection
- 'All Possible' models - compared using additional measures
- **Common Practice:** Try multiple methods to develop preliminary final model and then tweak as needed.
##
### Steps for Backward Elimination
1. Examine Matrix of Scatterplots and histograms and determine if any transformations are needed to linearize relationships between continuous predictors and response variable.
- Optional at this stage: Also examine correlation matrix to determine if some pairs of variables will be a concern
- **New term - Multicollinearity:** If two predictors (X variables) in model have a correlation of 0.8 or higher, they can not both stay in the model because they are multicollinear and cause the model to be unstable.
2. Create a 'saturated' model with all potential predictor variables and interaction terms
- This is subjective.
- Be as transparent as possible in your how you decide on your full model.
3. Use **'Backward Elimination'** to pare model down to a preliminary model
## Steps for Backward Elimination
4. Examine predictors in preliminary model to confirm they are not too highly correlated with each other.
- If two predictor variables have a correlation of 0.8 or greater, drop one of them (see above)
5. If model was modified in step 4, rerun model through Backward Elimination (not always needed).
6. Interpret final model.
##
### Plan for Thursday and HW 7
- In HW 7, you will examine the correlation matrix and then do straightforward versions of steps 3 and 6 of the model selection process.
- Thursday, we will look at a couple of interesting models selection examples.
::: fragment
**Example 1: Animals Data**
:::
- **Question: What factors affect a mammal's sleep duration?**
- Animals Data Notes:
- Population was limited to animals under 1000 pounds (two elephant species excluded).
- Natural log (LN) transformed variables were added to original data.
- Observations with missing values are removed below
- Working dataset has 49 observations (49 different species)
##
### Preview of Lecture 16 Animals Data
```{r import data and remove missing values}
# import and examine data
animals <- read_csv("data/animals.csv", show_col_types=F) |>
filter(!is.na(LifeSpan) & !is.na(Gestation))
head(animals) |> kable()
```
##
### Animals Data Dictionary - Description of Variables
```{r animals data dictionary table, echo=F}
Variable <- names(animals)
Type <- c("Nominal", rep("Quantitative", 8), rep("Ordinal", 3))
Description <- c("Name of Species",
"Total Sleep",
"Average Body Weight in kilograms",
"Natural Log of Body Weight",
"Average Brain Weight in grams",
"Natural Log of Brain Weight",
"Maximum Life Span in years",
"Natural Log of Life Span",
"Gestation Time in days",
"Predation Index (1=least likely to be prey)",
"Sleep Exposure Index (1=least exposed)",
"Overall Danger Index (1=least danger from other animals)")
(animal_data_dictionary = tibble (Variable, Type, Description) |> kable())
```
## {background-image="img/tired_panda_faded.png"}
### Key Points from Today
- Regression modeling can be overwhelming
- Automating part of the variable selection process is helpful.
- Today we introduced Backward Elimination
- Thursday we will look at a couple other model selection methods.
- Try different methods and compare results.
- Results from automated processes are preliminary.
- HW 6 due on Wed. 3/4 (Grace Period ends 3/6 at 10:00 PM).
- HW 7 will be posted by 3/5 and is due on Wed. 3/18.
::: fragment
**To submit an Engagement Question or Comment about material from Lecture 15:** Submit it by midnight today (day of lecture).
:::