Based on the Parameter Estimates table for the specified categorical regression model, which category is the baseline category?
Model Summary
--------------------------------------------------------------------------
R 0.924 RMSE 32079.481
R-Squared 0.854 MSE 1029093133.260
Adj. R-Squared 0.848 Coef. Var 8.223
Pred R-Squared 0.836 AIC 1352.620
MAE 28012.693 SBC 1360.792
--------------------------------------------------------------------------
RMSE: Root Mean Square Error
MSE: Mean Square Error
MAE: Mean Absolute Error
AIC: Akaike Information Criteria
SBC: Schwarz Bayesian Criteria
ANOVA
----------------------------------------------------------------------------------
Sum of
Squares DF Mean Square F Sig.
----------------------------------------------------------------------------------
Regression 341892090457.686 2 170946045228.843 157.37 0.0000
Residual 58658308595.823 54 1086264973.997
Total 400550399053.509 56
----------------------------------------------------------------------------------
Parameter Estimates
-----------------------------------------------------------------------------------------------------
model Beta Std. Error Std. Beta t Sig lower upper
-----------------------------------------------------------------------------------------------------
(Intercept) 137549.093 17620.447 7.806 0.000 102222.224 172875.963
Square_Feet 137.879 10.836 0.670 12.725 0.000 116.155 159.602
RemodeledYes 90917.216 8834.268 0.542 10.291 0.000 73205.575 108628.858
-----------------------------------------------------------------------------------------------------
A Comment About Formatted Output
In HW 6 and below I use R coding to format the output to make it easier to read.
The values are IDENTICAL to the unformatted output.
Note: Formatted Output will differ in appearance depending on where it is viewed, i.e. slides, html file, or .Qmd file.
Formatted Abridged Output (Similar to HW 6)
model
Beta
Std.Error
t
Sig
(Intercept)
137549.09
17620.45
7.81
0
Square_Feet
137.88
10.84
12.72
0
RemodeledYes
90917.22
8834.27
10.29
0
Quick Review of Categorical Regression
On Tuesday we covered the Parallel Lines Model:
A Parallel Lines model has two X variables, one quantitative and one categorical variable.
Model estimates a separate SLR model for each category in the categorical variable.
Model assumes all categories have the same SLOPE.
Model estimates a separate INTERCEPT for each category.
Model output shows results of a hypothesis test to determine if each non-baseline category’s intercept is significantly different from baseline intercept.
Interactive Plot of House Remodel Data
Calculations from House Model
By default R chooses baseline categories alphabetically
No is before Yes so un-Remodeled houses are the baseline
Un-Remodeled (No) SLR Model:
Est. Price = 137549.093 + 137.879 * Square_Feet
Remodeled (Yes) SLR Model:
Est. Price = 137549.093 + 137.879 * Square_Feet + 90917.216
Est. Price = 137549.093 + 90917.216 + 137.879 * Square_Feet
Est. Price = 228466.3 + 137.879 * Square_Feet
Interpretation:
Prices of remodeled houses are about 91 thousand dollars more than similar houses without remodeling, after accounting for square footage.
This difference is statistically significant (P-value < 0.001)
HW 6 - Questions 1 - 6
This part of HW 6 examines data similar to the House-Remodel data examined in Lecture 13 and the review question.
The dataset is smaller and the numbers are different, but the questions are essentially the same.
Categorical Regression with Interactions
The categorical models covered so far assume that the SLR models for all categories have the same slope.
How do we examine that assumption?
For example:
In the celebrity data in Lecture 13, the data showed a decrease in earnings as they got older.
Slope was assumed to be IDENTICAL for both males and females
That may not be true for all celebrities.
In the following small dataset, we will look at male celebrities only and examine if actors and athletes salaries follow the same trend.
Import and Examine Celebrity Profession Data
Code
```{r import and examine celeb_prof data, echo=T}# import and examine celeb profession datasetceleb_prof <- read_csv("data/celeb_prof.csv", show_col_types=F) head(celeb_prof) |> kable()# use table to summarize data by categoryceleb_prof |> select(Profession) |> table()```
Celebrity
Earnings
Age
Profession
Jim Parsons
29
44
Actor
Johnny Depp
48
53
Actor
Tom Cruise
53
55
Actor
Leonardo Dicaprio
29
43
Actor
Jackie Chan
61
62
Actor
Mark Wahlberg
32
45
Actor
Profession
Actor Athlete
8 8
Examine Correlations in Celebrity Professions Data
Note: If categories have different slopes, correlations for whole dataset will be misleading.
Code
celeb_prof |>select(Earnings, Age) |>cor() |>round(2) # all data
In this case, we don’t need to examine the P-values because the model differences between groups are so clear (but we will).
The two intercepts and two slopes are VERY different.
Reminder of Hypothesis Testing concepts:
The SMALLER the P-value, the more evidence there is that the true value of Beta (model term) is not zero.
If this sentence is not clear to you, you are responsible for reviewing the Review Materials on Hypothesis Tests and Significance Tests (and other related topics):
In this case, we don’t need to examine the P-values because the model differences between groups are so clear (but we will).
The two intercepts and two slopes are VERY different.
Reminder of Hypothesis Testing concepts:
The SMALLER the P-value, the more evidence there is that the true value of Beta (model term) is not zero.
If this sentence is not clear to you, you are responsible for reviewing the Review Materials on Hypothesis Tests and Significance Tests (and other related topics):
P-value for diff. in intercepts (GenreSuspense / Horror): < 0.001
Intercepts for these two distinct genre SLR models are _____ (Next Question).
P-value for diff. in slopes (Runtime:GenreSuspense / Horror term): < 0.001
Slopes for for these two distinct genre SLR models are _____ (Next Question).
These interpretations are in agreement with what we can easily see in the Interactive Model Plot.
Lecture 14 In-class Exercises - Q4
Abridged Output
The difference in intercepts and the difference in slopes between the model for Action movies and the model for Suspense / Horror movies are both _____.
A. statistically significant
B. statistically insignificant
Lecture 14 In-class Exercises - Q5
Recall that the cutoff for determining significance of a regression model term based on it’s P-value is 0.05.
Fill in the blank:
The smaller the P-value, the _____ evidence there is that the Beta coefficient is non-zero and the term is useful to the model.
HW 6 - Questions 7 - 16
Dataset has three categories of Diamonds:
Colorless, Faint yellow, and Nearly colorless
Colorless is first alphabetically so that is the baseline category by default.
Each color category has unique intercept AND a unique slope.
The interactive model plot and abridged regression output are provided.
All Blackboard questions can be answered by rendering .qmd file to examine .html output.
Helpful TIP: In addition to other recommended options, change preview option (see next slide).
HW 6 - Change HTML Preview Option
For HW 6 you do not have to write any R code.
Instead you are expected to correctly interpret provided output.
Quiz 2 will have similar output WITHOUT the interactive plots.
Change the following option in the Basic tab of the R Markdown options:
Show output preview in Viewer Pane
Looking Ahead - What’s Next?
This week, ALL of the categorical models could be simplified to multiple SLRs, with same or different slopes.
ALL of the variables have had P-values less than 0.05 so the terms were all useful.
There are many many model options where these two facts are not true.
Time permitting, here’s a brief look at a dataset with many explanatory variables:
Charges
ln_charges
Age
Sex
BMI
Children
Smoker
Region
16884.924
9.734176
19
female
27.90
0
yes
southwest
1725.552
7.453303
18
male
33.77
1
no
southeast
4449.462
8.400538
28
male
33.00
3
no
southeast
Insurance Data Model and Variable Selection
There are 3 quantitative variables:
Age, BMI, and Children
There are 3 categorical variables:
Sex, Smoker, Region
There are literally hundreds of possible models including interaction terms.
Note that an interaction can also be between two quantitative variables.
You can also have interaction terms with three variables (but I try to avoid those).
How do we sort through all of the possible options?
Software helps us pare down all the possible models to a few choices.
Analyst then uses critical thinking and examination of data to determine final model.
Model and variable selection methods are the next set of topics.
Key Points from Today
Categorical Interaction Model
Separate SLR for each group.
BOTH slopes and intercepts can differ by category
We can test if interaction term (slope difference) is significant.
Next Topics
Comparing model goodness of fit
Introduction to variable selection
HW 6 is now available and is due on Wed. 3/6.
Date of Quiz 2 has been changed to Tuesday, 4/1.
To submit an Engagement Question or Comment about material from Lecture 14: Submit it by midnight today (day of lecture).
Source Code
---title: "BUA 345 - Lecture 14"subtitle: "Categorical Regression - Interaction Model"author: "Penelope Pooler Eisenbies"date: last-modifiedlightbox: truetoc: truetoc-depth: 3toc-location: lefttoc-title: "Table of Contents"toc-expand: 1format: html: code-line-numbers: true code-fold: true code-tools: trueexecute: echo: fenced---## Housekeeping```{r setup, echo=FALSE, warning=F, message=F, include=F}#| include: false# this line specifies options for default options for all R Chunksknitr::opts_chunk$set(echo=F)# suppress scientific notationoptions(scipen=100)# install helper package that loads and installs other packages, if neededif (!require("pacman")) install.packages("pacman", repos = "http://lib.stat.cmu.edu/R/CRAN/")# install and load required packagespacman::p_load(pacman,tidyverse, magrittr, olsrr, shadowtext, mapproj, knitr, kableExtra, countrycode, usdata, maps, RColorBrewer, gridExtra, ggthemes, gt, tools, ggiraphExtra)# verify packages# p_loaded()```**HW 5 was due 2/26/2025** - 2 day grace period**HW 6 was due 3/5/2025** - 2 day grace period::: nonincremental- Demo videos will be posted this weekend.:::**Quiz 2 will be on 4/1/2025** - Date has changed and syllabus has been updated.### Today's plan- Review Parallel Lines Model- Introduce Interaction term and Interaction Model- Work through how to interpret model output- Introduce HW 6- Talk about next steps::: fragment**In-class Polling (Session ID: bua345s25)**:::## ### Review Question - Import data```{r import and examine house remodel data, echo=T}house_remodel <- read_csv("data/house_remodel.csv", show_col_types = F)head(house_remodel, 3) |> kable()house_remodel |> select(Remodeled) |> table() # number of obs by categoryhouse_remodel |> select(Price, Square_Feet) |> cor() |> round(2) # correlation of price & sq. ft.```## ### Lecture 14 In-class Exercises - Q1 - ReviewBased on the `Parameter Estimates` table for the specified categorical regression model, which category is the baseline category?::: r-fit-text```{r house model specified}(house_rem_cat_ols<- ols_regress(Price ~ Square_Feet + Remodeled, data=house_remodel))```:::## ### A Comment About Formatted Output- In HW 6 and below I use R coding to format the output to make it easier to read.- The values are IDENTICAL to the unformatted output.- Note: Formatted Output will differ in appearance depending on where it is viewed, i.e. slides, html file, or .Qmd file.::: fragment#### Formatted Abridged Output (Similar to HW 6)```{r house model formal param estimated kable table, echo=F}# formatted regression output# model is saved and printed to screenhouse_rem_cat_ols<- ols_regress(Price ~ Square_Feet + Remodeled, data=house_remodel)model_out <- tibble(house_rem_cat_ols$mvars, # select columns from saved model house_rem_cat_ols$betas, house_rem_cat_ols$std_errors, house_rem_cat_ols$tvalues, house_rem_cat_ols$pvalues) |> mutate(`model` = `house_rem_cat_ols$mvars`, # round and rename columns `Beta` = round(`house_rem_cat_ols$betas`,2), `Std.Error` = round(`house_rem_cat_ols$std_errors`,2), `t` = round(`house_rem_cat_ols$tvalues`,2), `Sig` = round(`house_rem_cat_ols$pvalues`,4)) |> select(6:10) # select output columnsmodel_out |> kable()```:::## ### Quick Review of Categorical Regression- On Tuesday we covered the `Parallel Lines` Model: - A Parallel Lines model has two X variables, one quantitative and one categorical variable. - Model estimates a separate SLR model for each category in the categorical variable. - Model assumes all categories have the same **SLOPE**. - Model estimates a separate **INTERCEPT** for each category. - Model output shows results of a hypothesis test to determine if each non-baseline category's intercept is significantly different from baseline intercept.## Interactive Plot of House Remodel Data```{r house remodel mlr model, echo=F}# mlr model created using lmrem_cat_lm <- lm(Price ~ Square_Feet + Remodeled, data=house_remodel)# create and save interactive plot(int_rem_mlr <- ggPredict(rem_cat_lm, interactive=T))```## Calculations from House Model- By default R chooses baseline categories alphabetically - `No` is before `Yes` so un-`Remodeled` houses are the baseline - Un-Remodeled (`No`) SLR Model: - `Est. Price = 137549.093 + 137.879 * Square_Feet` - Remodeled (`Yes`) SLR Model: - `Est. Price = 137549.093 + 137.879 * Square_Feet + 90917.216` - `Est. Price = 137549.093 + 90917.216 + 137.879 * Square_Feet` - `Est. Price = 228466.3 + 137.879 * Square_Feet`- **Interpretation:** - Prices of remodeled houses are about ***91 thousand dollars more than similar houses without remodeling***, after accounting for square footage. - This difference is statistically significant (P-value \< 0.001)## HW 6 - Questions 1 - 6- This part of HW 6 examines data similar to the House-Remodel data examined in Lecture 13 and the review question.- The dataset is smaller and the numbers are different, but the questions are essentially the same.::: fragment{fig-align="center"}:::## ### Categorical Regression with Interactions- The categorical models covered so far assume that the `SLR` models for all categories have the same slope.- How do we examine that assumption?- For example: - In the celebrity data in Lecture 13, the data showed a decrease in earnings as they got older. - Slope was assumed to be IDENTICAL for both males and females - That may not be true for all celebrities.- In the following small dataset, we will look at male celebrities only and examine if actors and athletes salaries follow the same trend.## ### Import and Examine Celebrity Profession Data```{r import and examine celeb_prof data, echo=T}# import and examine celeb profession datasetceleb_prof <- read_csv("data/celeb_prof.csv", show_col_types=F) head(celeb_prof) |> kable()# use table to summarize data by categoryceleb_prof |> select(Profession) |> table()```## ### Examine Correlations in Celebrity Professions Data**Note:** If categories have different slopes, correlations for whole dataset will be misleading.```{r correlations celeb_prof data, echo=TRUE}celeb_prof |> select(Earnings, Age) |> cor() |> round(2) # all dataceleb_prof |> filter(Profession=="Actor") |> # actors only select(Earnings, Age) |> cor() |> round(2)celeb_prof |> filter(Profession=="Athlete") |> # athletes only select(Earnings, Age) |> cor() |> round(2)```## Explore and Plot Data::::: columns::: {.column width="40%"}- Scatter plot shows that a regression model should be created with - Different intercepts for each profession - Different slopes for each profession```{r exploratory scatter plot code for celeb_prof data, echo=F}celeb_sctrplot <- celeb_prof |> #scatterplot code ggplot() + geom_point(aes(x=Age, y=Earnings, color=Profession), size=4) + theme_classic() + theme(axis.title = element_text(size=18), axis.text = element_text(size=15), legend.text = element_text(size=10), plot.background = element_rect(colour = "darkgrey", fill=NA, linewidth=2))```:::::: {.column width="60%"}```{r show celeb scatterplot, fig.dim=c(7,6) , echo=F}celeb_sctrplot```::::::::## ### Interactive Model Plot - Celebrity Professions```{r celeb profession model plot, echo=FALSE}# create linear model with categories an interactionceleb_prof_lm <- lm(Earnings ~ Age + Profession + Age*Profession, data=celeb_prof)# create interactive plot of model# copy and paste this into console (below) to see plot in viewer(clb_int <- ggPredict(celeb_prof_lm, interactive = T))```## ### Regression Model - Celebrity ProfessionsNow that we understand the data and linear trends, we can examine and interpret the regression model output.::: r-fit-text```{r celeb_prof regression}(celeb_interaction_ols<- ols_regress(Earnings ~ Age + Profession + Age*Profession, data=celeb_prof, iterm = T))```:::## ### Model Interpretation - Interpreting model coefficients (betas)- **Baseline category is first alphabetically** - Actor comes before Athlete in the alphabet so **Actor is the baseline category**. - \*\*Actor SLR Model: - `Earnings = -50.297 + 1.824*Age`\*\*- **`ProfessionAthlete` term:** Difference from Actor (baseline) model Intercept- **`Age:ProfessionAthlete` term:** Difference from Actor model Slope - Athlete SLR Model requires some calculations: - `Earnings = -50.297 + 1.824*Age + 227.218 - 5.063*Age` - `Earnings = (-50.297 + 227.218) + (1.824 - 5.063)*Age` - `Earnings = 176.921 - ____*Age` - ***Athlete SLR Model:*** `Earnings = 176.921 - ____*Age`## ### Lecture 14 In-class Exercises - Q2**What is the slope term (estimated `beta` for Age) for the Athlete SLR model?**Specify answer to two decimal places.#### Abridged Output{fig-align="center"}- **`ProfessionAthlete` term:** Difference from Actor (baseline) model Intercept- **`Age:ProfessionAthlete` term:** Difference from Actor model Slope - `Earnings = (-50.297 + 227.218) + (1.824 - 5.063)*Age` - ***Athlete SLR Model:*** `Earnings = 176.921 - ____*Age`## Determining Statistical Significance- In this case, we don't need to examine the P-values because the model differences between groups are so clear (but we will).- The two intercepts and two slopes are **VERY** different.- Reminder of Hypothesis Testing concepts: - **The SMALLER the P-value, the more evidence there is that the true value of Beta (model term) is not zero.** - If this sentence is not clear to you, **you are responsible for reviewing the Review Materials** on Hypothesis Tests and Significance Tests (and other related topics): - [**MAS 261 Lecture Slides and Notes**](https://penelope2040.quarto.pub/mas-261/#most-recent-lecture-material){target="_blank"} - [**Khan Academy - The Idea of Significance tests**](https://www.khanacademy.org/math/statistics-probability/significance-tests-one-sample/idea-of-significance-tests/v/simple-hypothesis-testing){target="_blank"} - [**Khan Academy - Comparing a P-value to a Significance level**](https://www.khanacademy.org/math/statistics-probability/significance-tests-one-sample/tests-about-population-mean/v/comparing-p-value-from-t-statistic-to-significance-level?modal=1){target="_blank"}## ### Conclusions from Actor and Athlete Model#### Abridged Output{fig-align="center"}- **P-value for difference in intercepts (`ProfessionAthlete`):** \< 0.001 - Actors SLR model and Athletes SLR model intercepts are significantly different.- **P-value for difference in slopes (`Age:ProfessionAthlete`):** \< 0.001 - Actors SLR model and Athletes SLR model slopes are significantly different.- **These interpretations are in agreement with what we can easily see in the Interactive Model Plot.**## Movie Genres and Costs- Is length of a movie (`Runtime`) a good predictor of the movie budget?- Does the relationship between movie length and budget differ by movie genre?::: fragment#### Import Movie Data```{r import movies data, echo=T}movies <- read_csv("data/movies.csv", show_col_types = F) # Import and examine datahead(movies, 4) |> kable()movies |> select(Genre) |> table() # use table to examine categories in the data```:::## Examine Correlations in Movie Data**Note:** If categories have different slopes, correlations for whole dataset will be misleading.```{r correlations movie data, echo=T}movies |> select(Budget, Runtime) |> cor() |> round(2) # all datamovies |> filter(Genre == "Action") |> # action movies select(Budget, Runtime) |> cor() |> round(2)movies |> filter(Genre == "Suspense / Horror") |> # suspense/horror movies select(Budget, Runtime) |> cor() |> round(2)```## Explore and Plot Data::::: columns::: {.column width="40%"}- Scatter plot shows that a regression model should be created with - Different intercepts for each genre - Different slopes for each genre```{r exploratory scatter plot code for movie data, echo=F}movie_scrtplot <- movies |> ggplot() + geom_point(aes(x=Runtime, y=Budget, color=Genre), size=4) + theme_classic() + labs(x="Run time (Movie Length)") + theme(axis.title = element_text(size=18), axis.text = element_text(size=15), legend.text = element_text(size=10), plot.background = element_rect(colour = "darkgrey", fill=NA, linewidth=2))```:::::: {.column width="60%"}```{r show movie scatterplot, fig.dim=c(7,6), echo=F}movie_scrtplot```::::::::## Interactive Model Plot - Movie Data```{r movie model plot, echo=F}# create linear model with categories an interactionmovie_cat_lm <- lm(Budget ~ Runtime + Genre + Runtime*Genre, data=movies)# create interactive plot of model# copy and paste this into console (below) to see plot in viewer(movie_int <- ggPredict(movie_cat_lm, interactive = T))```## ### Movie Genres Regression ModelAgain, we can examine and interpret the regression model output.```{r movie regression, results='hide'}(movie_interaction_ols<- ols_regress(Budget ~ Runtime + Genre + Runtime*Genre, data=movies, iterm = T))```#### Abridged Output{fig-align="center"}## ### Lecture 14 In-class Exercises - Q3**What is the intercept term (estimated `beta`) for the Suspense / Horror SLR model?**- Specify answer to two decimal places.::: fragment#### Abridged Output{fig-align="center"}:::- **`GenreSuspense / Horror` term:** Difference from Action (baseline) model Intercept- **`Runtime:GenreSuspense / Horror term`:** Difference from Action model Slope - `Budget = (-286.67 + 218.75) + (3.25 - 2.35)*Runtime` - ***Suspense / Horror SLR Model:*** `Budget = ____ - 0.9*Runtime`## Determining Statistical Significance- In this case, we don't need to examine the P-values because the model differences between groups are so clear (but we will).- The two intercepts and two slopes are **VERY** different.- Reminder of Hypothesis Testing concepts: - **The SMALLER the P-value, the more evidence there is that the true value of Beta (model term) is not zero.** - If this sentence is not clear to you, **you are responsible for reviewing the Review Materials** on Hypothesis Tests and Significance Tests (and other related topics): - [**MAS 261 Lecture Slides and Notes**](https://penelope2040.quarto.pub/mas-261/#most-recent-lecture-material){target="_blank"} - [**Khan Academy - The Idea of Significance tests**](https://www.khanacademy.org/math/statistics-probability/significance-tests-one-sample/idea-of-significance-tests/v/simple-hypothesis-testing){target="_blank"} - [**Khan Academy - Comparing a P-value to a Significance level**](https://www.khanacademy.org/math/statistics-probability/significance-tests-one-sample/tests-about-population-mean/v/comparing-p-value-from-t-statistic-to-significance-level?modal=1){target="_blank"}## ### Conclusions from Movie Genre Model#### Abridged Output{fig-align="center"}- **P-value for diff. in intercepts (`GenreSuspense / Horror`):** \< 0.001 - Intercepts for these two distinct genre SLR models are `_____` (Next Question).- **P-value for diff. in slopes (`Runtime:GenreSuspense / Horror term`):** \< 0.001 - Slopes for for these two distinct genre SLR models are `_____` (Next Question).- **These interpretations are in agreement with what we can easily see in the Interactive Model Plot.**## ### Lecture 14 In-class Exercises - Q4#### Abridged Output{fig-align="center"}**The difference in intercepts and the difference in slopes between the model for Action movies and the model for Suspense / Horror movies are both `_____`.**::: nonincrementalA. statistically significantB. statistically insignificant:::## ### Lecture 14 In-class Exercises - Q5Recall that the cutoff for determining significance of a regression model term based on it's P-value is 0.05.<br>Fill in the blank:**The smaller the P-value, the `_____` evidence there is that the Beta coefficient is non-zero and the term is useful to the model.**## ### HW 6 - Questions 7 - 16- Dataset has three categories of Diamonds: - `Colorless`, `Faint yellow`, and `Nearly colorless`- **Colorless** is first alphabetically so that is the **baseline category** by default. - Each color category has unique intercept AND a unique slope. - The interactive model plot and abridged regression output are provided. - All Blackboard questions can be answered by rendering .qmd file to examine .html output.- **Helpful TIP:** In addition to other recommended options, change preview option (see next slide).## ### HW 6 - Change HTML Preview Option- For HW 6 you do not have to write any R code.- Instead you are expected to correctly interpret provided output.- Quiz 2 will have similar output **WITHOUT the interactive plots**.- Change the following option in the `Basic` tab of the `R Markdown` options:::: fragment{fig-align="center"}:::## Looking Ahead - What's Next?- This week, ALL of the categorical models could be simplified to multiple SLRs, with same or different slopes.- ALL of the variables have had P-values less than 0.05 so the terms were all useful.- **There are many many model options where these two facts are not true.**- Time permitting, here's a brief look at a dataset with many explanatory variables:::: fragment```{r brief look at another dataset}insurance <- read_csv("data/Insurance.csv", show_col_types=F)head(insurance, 3) |> kable()```:::## ### Insurance Data Model and Variable Selection- There are 3 quantitative variables: - **Age, BMI, and Children**- There are 3 categorical variables: - **Sex, Smoker, Region**- There are literally hundreds of possible models including interaction terms.- Note that an interaction can also be between two quantitative variables.- You can also have interaction terms with three variables (but I try to avoid those).- How do we sort through all of the possible options? - Software helps us pare down all the possible models to a few choices. - Analyst then uses critical thinking and examination of data to determine final model.- Model and variable selection methods are the next set of topics.## ### Key Points from Today- **Categorical Interaction Model** - Separate SLR for each group. - BOTH slopes and intercepts can differ by category - We can test if interaction term (slope difference) is significant.- **Next Topics** - Comparing model goodness of fit - Introduction to variable selection<br>- **HW 6 is now available and is due on Wed. 3/6.**- **Date of Quiz 2 has been changed to Tuesday, 4/1.**::: fragment**To submit an Engagement Question or Comment about material from Lecture 14:** Submit it by midnight today (day of lecture).:::