---
title: "BUA 345 - Lecture 14"
subtitle: "Categorical Regression - Interaction Model"
author: "Penelope Pooler Eisenbies"
date: last-modified
lightbox: true
toc: true
toc-depth: 3
toc-location: left
toc-title: "Table of Contents"
toc-expand: 1
format:
html:
code-line-numbers: true
code-fold: true
code-tools: true
execute:
echo: fenced
---
## Housekeeping
```{r setup, echo=FALSE, warning=F, message=F, include=F}
#| include: false
# this line specifies options for default options for all R Chunks
knitr::opts_chunk$set(echo=F)
# suppress scientific notation
options(scipen=100)
# install helper package that loads and installs other packages, if needed
if (!require("pacman")) install.packages("pacman", repos = "http://lib.stat.cmu.edu/R/CRAN/")
# install and load required packages
pacman::p_load(pacman,tidyverse, magrittr, olsrr,
shadowtext, mapproj, knitr, kableExtra,
countrycode, usdata, maps, RColorBrewer,
gridExtra, ggthemes, gt, tools,
ggiraphExtra)
# verify packages
# p_loaded()
```
**HW 5 is due 2/27/2026** - 3 day grace period
**HW 6 is due 3/4/2026** - 2 day grace period
**HW 7 and HW 8 will due be after break** - Can be completed without working during break.
**Quiz 2 will be on 3/26/2026** - Practice Questions will be posted after break.
### Today's plan
- Review Parallel Lines Model
- Introduce Interaction term and Interaction Model
- Work through how to interpret model output
- Introduce HW 6
- Talk about next steps
##
### Review Question - Import data
```{r import and examine house remodel data, echo=T}
house_remodel <- read_csv("data/house_remodel.csv", show_col_types = F)
head(house_remodel, 3) |> kable()
house_remodel |> select(Remodeled) |> table() # number of obs by category
house_remodel |> select(Price, Square_Feet) |> cor() |> round(2) # correlation of price & sq. ft.
```
##
### Lecture 14 In-class Exercises - Q1 - Review
[***Poll Everywhere***](https://pollev.com/penelopepoolereisenbies685){target="_blank"} - My User Name: **penelopepoolereisenbies685**
Based on the `Parameter Estimates` table for the specified categorical regression model, which category is the baseline category?
::: r-fit-text
```{r house model specified}
(house_rem_cat_ols<- ols_regress(Price ~ Square_Feet + Remodeled, data=house_remodel))
```
:::
##
### A Comment About Formatted Output
- In HW 6 and below I use R coding to format the output to make it easier to read.
- The values are IDENTICAL to the unformatted output.
- Note: Formatted Output will differ in appearance depending on where it is viewed, i.e. slides, html file, or .Qmd file.
::: fragment
#### Formatted Abridged Output (Similar to HW 6)
```{r house model formal param estimated kable table, echo=F}
# formatted regression output
# model is saved and printed to screen
house_rem_cat_ols<- ols_regress(Price ~ Square_Feet + Remodeled, data=house_remodel)
model_out <- tibble(house_rem_cat_ols$mvars, # select columns from saved model
house_rem_cat_ols$betas,
house_rem_cat_ols$std_errors,
house_rem_cat_ols$tvalues,
house_rem_cat_ols$pvalues) |>
mutate(`model` = `house_rem_cat_ols$mvars`, # round and rename columns
`Beta` = round(`house_rem_cat_ols$betas`,2),
`Std.Error` = round(`house_rem_cat_ols$std_errors`,2),
`t` = round(`house_rem_cat_ols$tvalues`,2),
`Sig` = round(`house_rem_cat_ols$pvalues`,4)) |>
select(6:10) # select output columns
model_out |> kable()
```
:::
##
### Quick Review of Categorical Regression
- On Tuesday we covered the `Parallel Lines` Model:
- A Parallel Lines model has two X variables, one quantitative and one categorical variable.
- Model estimates a separate SLR model for each category in the categorical variable.
- Model assumes all categories have the same **SLOPE**.
- Model estimates a separate **INTERCEPT** for each category.
- Model output shows results of a hypothesis test to determine if each non-baseline category's intercept is significantly different from baseline intercept.
## Interactive Plot of House Remodel Data
```{r house remodel mlr model, echo=F}
# mlr model created using lm
rem_cat_lm <- lm(Price ~ Square_Feet + Remodeled, data=house_remodel)
# create and save interactive plot
(int_rem_mlr <- ggPredict(rem_cat_lm, interactive=T))
```
## Calculations from House Model
- By default R chooses baseline categories alphabetically
- `No` is before `Yes` so un-`Remodeled` houses are the baseline
- Un-Remodeled (`No`) SLR Model:
- `Est. Price = 137549.093 + 137.879 * Square_Feet`
- Remodeled (`Yes`) SLR Model:
- `Est. Price = 137549.093 + 137.879 * Square_Feet + 90917.216`
- `Est. Price = 137549.093 + 90917.216 + 137.879 * Square_Feet`
- `Est. Price = 228466.3 + 137.879 * Square_Feet`
- **Interpretation:**
- Prices of remodeled houses are about ***91 thousand dollars more than similar houses without remodeling***, after accounting for square footage.
- This difference is statistically significant (P-value \< 0.001)
## HW 6 - Questions 1 - 6
- This part of HW 6 examines data similar to the House-Remodel data examined in Lecture 13 and the review question.
- The dataset is smaller and the numbers are different, but the questions are essentially the same.
::: fragment
{fig-align="center"}
:::
##
### Categorical Regression with Interactions
- The categorical models covered so far assume that the `SLR` models for all categories have the same slope.
- How do we examine that assumption?
- For example:
- In the celebrity data in Lecture 13, the data showed a decrease in earnings as they got older.
- Slope was assumed to be IDENTICAL for both males and females
- That may not be true for all celebrities.
- In the following small dataset, we will look at male celebrities only and examine if actors and athletes salaries follow the same trend.
##
### Import and Examine Celebrity Profession Data
```{r import and examine celeb_prof data, echo=T}
# import and examine celeb profession dataset
celeb_prof <- read_csv("data/celeb_prof.csv", show_col_types=F)
head(celeb_prof) |> kable()
# use table to summarize data by category
celeb_prof |> select(Profession) |> table()
```
##
### Examine Correlations in Celebrity Professions Data
**Note:** If categories have different slopes, correlations for whole dataset will be misleading.
```{r correlations celeb_prof data, echo=TRUE}
celeb_prof |> select(Earnings, Age) |> cor() |> round(2) # all data
celeb_prof |> filter(Profession=="Actor") |> # actors only
select(Earnings, Age) |> cor() |> round(2)
celeb_prof |> filter(Profession=="Athlete") |> # athletes only
select(Earnings, Age) |> cor() |> round(2)
```
## Explore and Plot Data
::::: columns
::: {.column width="40%"}
- Scatter plot shows that a regression model should be created with
- Different intercepts for each profession
- Different slopes for each profession
```{r exploratory scatter plot code for celeb_prof data, echo=F}
celeb_sctrplot <- celeb_prof |> #scatterplot code
ggplot() +
geom_point(aes(x=Age, y=Earnings, color=Profession), size=4) +
theme_classic() +
theme(axis.title = element_text(size=18),
axis.text = element_text(size=15),
legend.text = element_text(size=10),
plot.background = element_rect(colour = "darkgrey", fill=NA, linewidth=2))
```
:::
::: {.column width="60%"}
```{r show celeb scatterplot, fig.dim=c(7,6) , echo=F}
celeb_sctrplot
```
:::
:::::
##
### Interactive Model Plot - Celebrity Professions
```{r celeb profession model plot, echo=FALSE}
# create linear model with categories an interaction
celeb_prof_lm <-
lm(Earnings ~ Age + Profession +
Age*Profession, data=celeb_prof)
# create interactive plot of model
# copy and paste this into console (below) to see plot in viewer
(clb_int <- ggPredict(celeb_prof_lm,
interactive = T))
```
##
### Regression Model - Celebrity Professions
Now that we understand the data and linear trends, we can examine and interpret the regression model output.
::: r-fit-text
```{r celeb_prof regression}
(celeb_interaction_ols<- ols_regress(Earnings ~ Age + Profession + Age*Profession,
data=celeb_prof, iterm = T))
```
:::
##
### Interpreting Model Coefficients (betas)
- **Baseline category is first alphabetically**
- Actor comes before Athlete in the alphabet so **Actor is the baseline category**.
- **Actor SLR Model:**
- **`Earnings = -50.297 + 1.824*Age`**
- **`ProfessionAthlete` term:** Difference from Actor (baseline) model Intercept
- **`Age:ProfessionAthlete` term:** Difference from Actor model Slope
- Athlete SLR Model requires some calculations:
- `Earnings = -50.297 + 1.824*Age + 227.218 - 5.063*Age`
- `Earnings = (-50.297 + 227.218) + (1.824 - 5.063)*Age`
- `Earnings = 176.921 - ____*Age`
- ***Athlete SLR Model:*** `Earnings = 176.921 - ____*Age`
##
### Lecture 14 In-class Exercises - Q2
[***Poll Everywhere***](https://pollev.com/penelopepoolereisenbies685){target="_blank"} - My User Name: **penelopepoolereisenbies685**
**What is the slope term (estimated `beta` for Age) for the Athlete SLR model?**
Specify answer to two decimal places.
#### Abridged Output
{fig-align="center"}
- **`ProfessionAthlete` term:** Difference from Actor (baseline) model Intercept
- **`Age:ProfessionAthlete` term:** Difference from Actor model Slope
- `Earnings = (-50.297 + 227.218) + (1.824 - 5.063)*Age`
- ***Athlete SLR Model:*** `Earnings = 176.921 - ____*Age`
## Determining Statistical Significance
- In this case, we don't need to examine the P-values because the model differences between groups are so clear (but we will).
- The two intercepts and two slopes are **VERY** different.
- Reminder of Hypothesis Testing concepts:
- **The SMALLER the P-value, the more evidence there is that the true value of Beta (model term) is not zero.**
- If this sentence is not clear to you, **you are responsible for reviewing the Review Materials** on Hypothesis Tests and Significance Tests (and other related topics):
- [**MAS 261 Lecture Slides and Notes**](https://peneloopy.github.io/mas_261_perm/){target="_blank"}
- [**Khan Academy - The Idea of Significance tests**](https://www.khanacademy.org/math/statistics-probability/significance-tests-one-sample/idea-of-significance-tests/v/simple-hypothesis-testing){target="_blank"}
- [**Khan Academy - Comparing a P-value to a Significance level**](https://www.khanacademy.org/math/statistics-probability/significance-tests-one-sample/tests-about-population-mean/v/comparing-p-value-from-t-statistic-to-significance-level?modal=1){target="_blank"}
##
### Conclusions from Actor and Athlete Model
#### Abridged Output
{fig-align="center"}
- **P-value for difference in intercepts (`ProfessionAthlete`):** \< 0.001
- Actors SLR model and Athletes SLR model intercepts are significantly different.
- **P-value for difference in slopes (`Age:ProfessionAthlete`):** \< 0.001
- Actors SLR model and Athletes SLR model slopes are significantly different.
- **These interpretations are in agreement with what we can easily see in the Interactive Model Plot.**
## Movie Genres and Costs
- Is length of a movie (`Runtime`) a good predictor of the movie budget?
- Does the relationship between movie length and budget differ by movie genre?
::: fragment
#### Import Movie Data
```{r import movies data, echo=T}
movies <- read_csv("data/movies.csv", show_col_types = F) # Import and examine data
head(movies, 4) |> kable()
movies |> select(Genre) |> table() # use table to examine categories in the data
```
:::
## Examine Correlations in Movie Data
**Note:** If categories have different slopes, correlations for whole dataset will be misleading.
```{r correlations movie data, echo=T}
movies |> select(Budget, Runtime) |> cor() |> round(2) # all data
movies |> filter(Genre == "Action") |> # action movies
select(Budget, Runtime) |> cor() |> round(2)
movies |> filter(Genre == "Suspense / Horror") |> # suspense/horror movies
select(Budget, Runtime) |> cor() |> round(2)
```
## Explore and Plot Data
::::: columns
::: {.column width="40%"}
- Scatter plot shows that a regression model should be created with
- Different intercepts for each genre
- Different slopes for each genre
```{r exploratory scatter plot code for movie data, echo=F}
movie_scrtplot <- movies |>
ggplot() +
geom_point(aes(x=Runtime, y=Budget, color=Genre), size=4) +
theme_classic() +
labs(x="Run time (Movie Length)") +
theme(axis.title = element_text(size=18),
axis.text = element_text(size=15),
legend.text = element_text(size=10),
plot.background = element_rect(colour = "darkgrey", fill=NA, linewidth=2))
```
:::
::: {.column width="60%"}
```{r show movie scatterplot, fig.dim=c(7,6), echo=F}
movie_scrtplot
```
:::
:::::
## Interactive Model Plot - Movie Data
```{r movie model plot, echo=F}
# create linear model with categories an interaction
movie_cat_lm <- lm(Budget ~ Runtime + Genre +
Runtime*Genre, data=movies)
# create interactive plot of model
# copy and paste this into console (below) to see plot in viewer
(movie_int <- ggPredict(movie_cat_lm,
interactive = T))
```
##
### Movie Genres Regression Model
Again, we can examine and interpret the regression model output.
```{r movie regression, results='hide'}
(movie_interaction_ols<- ols_regress(Budget ~ Runtime + Genre +
Runtime*Genre, data=movies, iterm = T))
```
#### Abridged Output
{fig-align="center"}
##
### Lecture 14 In-class Exercises - Q3
[***Poll Everywhere***](https://pollev.com/penelopepoolereisenbies685){target="_blank"} - My User Name: **penelopepoolereisenbies685**
**What is the intercept term (estimated `beta`) for the Suspense / Horror SLR model?**
- Specify answer to two decimal places.
::: fragment
#### Abridged Output
{fig-align="center"}
:::
- **`GenreSuspense / Horror` term:** Difference from Action (baseline) model Intercept
- **`Runtime:GenreSuspense / Horror term`:** Difference from Action model Slope
- `Budget = (-286.67 + 218.75) + (3.25 - 2.35)*Runtime`
- ***Suspense / Horror SLR Model:*** `Budget = ____ - 0.9*Runtime`
## Determining Statistical Significance
- In this case, we don't need to examine the P-values because the model differences between groups are so clear (but we will).
- The two intercepts and two slopes are **VERY** different.
- Reminder of Hypothesis Testing concepts:
- **The SMALLER the P-value, the more evidence there is that the true value of Beta (model term) is not zero.**
- If this sentence is not clear to you, **you are responsible for reviewing the Review Materials** on Hypothesis Tests and Significance Tests (and other related topics):
- [**MAS 261 Lecture Slides and Notes**](https://peneloopy.github.io/mas_261_perm/){target="_blank"}
- [**Khan Academy - The Idea of Significance tests**](https://www.khanacademy.org/math/statistics-probability/significance-tests-one-sample/idea-of-significance-tests/v/simple-hypothesis-testing){target="_blank"}
- [**Khan Academy - Comparing a P-value to a Significance level**](https://www.khanacademy.org/math/statistics-probability/significance-tests-one-sample/tests-about-population-mean/v/comparing-p-value-from-t-statistic-to-significance-level?modal=1){target="_blank"}
##
### Conclusions from Movie Genre Model
#### Abridged Output
{fig-align="center"}
- **P-value for diff. in intercepts (`GenreSuspense / Horror`):** \< 0.001
- Intercepts for these two distinct genre SLR models are `_____` (Next Question).
- **P-value for diff. in slopes (`Runtime:GenreSuspense / Horror term`):** \< 0.001
- Slopes for for these two distinct genre SLR models are `_____` (Next Question).
- **These interpretations are in agreement with what we can easily see in the Interactive Model Plot.**
##
### Lecture 14 In-class Exercises - Q4
[***Poll Everywhere***](https://pollev.com/penelopepoolereisenbies685){target="_blank"} - My User Name: **penelopepoolereisenbies685**
#### Abridged Output
{fig-align="center"}
**The difference in intercepts and the difference in slopes between the model for Action movies and the model for Suspense / Horror movies are both `_____`.**
::: nonincremental
A. statistically significant
B. statistically insignificant
:::
##
### Lecture 14 In-class Exercises - Q5
[***Poll Everywhere***](https://pollev.com/penelopepoolereisenbies685){target="_blank"} - My User Name: **penelopepoolereisenbies685**
Recall that the cutoff for determining significance of a regression model term based on it's P-value is 0.05.
<br>
Fill in the blank:
**The smaller the P-value, the `_____` evidence there is that the Beta coefficient is non-zero and the term is useful to the model.**
##
### HW 6 - Questions 7 - 16
- Dataset has three categories of Diamonds:
- `Colorless`, `Faint yellow`, and `Nearly colorless`
- **Colorless** is first alphabetically so that is the **baseline category** by default.
- Each color category has unique intercept AND a unique slope.
- The interactive model plot and abridged regression output are provided.
- All Blackboard questions can be answered by rendering .qmd file to examine .html output.
- **Helpful TIP:** In addition to other recommended options, change preview option (see next slide).
##
### HW 6 - HTML Options
- For HW 6 you do not have to write any R code.
- Instead you are expected to correctly interpret provided output.
- Quiz 2 will have similar output **WITHOUT the interactive plots**.
- I have provided the HTML file but you are encouraged to edit the HW 6 file with your own notes and render it.
- You can publish your rendered file using Rpubs and save the link.
- Demo in class
## What if the Slopes Aren't Different?
- Recall the House Remodel Data from Lecture 13
- These data were clearly modeled by two parallel lines.
- What happens if we add an interaction term to test for different slopes?
- P-value (Sig) is much greater than 0.05 indicating that interaction term is not needed.
::: fragment
**Abridged Output**

:::
## Looking Ahead - What's Next?
- This week, ALL of the categorical models could be simplified to multiple SLRs, with same or different slopes.
- ALL of the variables have had P-values less than 0.05 so the terms were all useful (except on the previous slide).
- **There are many many model options where these two facts are not true.**
- Time permitting, here's a brief look at a dataset with many explanatory variables:
::: fragment
```{r brief look at another dataset}
insurance <- read_csv("data/Insurance.csv", show_col_types=F)
head(insurance, 3) |> kable()
```
:::
##
### Insurance Data Model and Variable Selection
- There are 3 quantitative variables:
- **Age, BMI, and Children**
- There are 3 categorical variables:
- **Sex, Smoker, Region**
- There are literally hundreds of possible models including interaction terms.
- Note that an interaction can also be between two quantitative variables.
- You can also have interaction terms with three variables (but I try to avoid those).
- How do we sort through all of the possible options?
- Software helps us pare down all the possible models to a few choices.
- Analyst then uses critical thinking and examination of data to determine final model.
- Model and variable selection methods are the next set of topics.
##
### Key Points from Today
- **Categorical Interaction Model**
- Separate SLR for each group.
- BOTH slopes and intercepts can differ by category
- We can test if interaction term (slope difference) is significant.
- **Next Topics**
- Comparing model goodness of fit
- Introduction to variable selection
<br>
- **HW 6 is now available and is due on Wed. 3/4.**
- **Quiz 2 is 3/26**
::: fragment
**To submit an Engagement Question or Comment about material from Lecture 14:** Submit it by midnight today (day of lecture).
:::