BUA 345 - Lecture 14

Categorical Regression - Interaction Model

Author

Penelope Pooler Eisenbies

Published

February 27, 2025

Housekeeping

HW 5 was due 2/26/2025 - 2 day grace period

HW 6 was due 3/5/2025 - 2 day grace period

Demo videos will be posted this weekend.

Quiz 2 will be on 4/1/2025 - Date has changed and syllabus has been updated.

Today’s plan

Review Parallel Lines Model
Introduce Interaction term and Interaction Model
Work through how to interpret model output
Introduce HW 6
Talk about next steps

In-class Polling (Session ID: bua345s25)

Review Question - Import data

Code

```{r import and examine house remodel data, echo=T}
house_remodel <- read_csv("data/house_remodel.csv", show_col_types = F)
head(house_remodel, 3) |> kable()

house_remodel |> select(Remodeled) |> table()                     # number of obs by category
house_remodel |> select(Price, Square_Feet) |> cor() |> round(2)  # correlation of price & sq. ft.
```

Price	Square_Feet	Remodeled
554000	2702	No
484000	2378	No
391000	1846	No

Remodeled
 No Yes 
 29  28 
            Price Square_Feet
Price        1.00        0.75
Square_Feet  0.75        1.00

Lecture 14 In-class Exercises - Q1 - Review

Based on the Parameter Estimates table for the specified categorical regression model, which category is the baseline category?

                              Model Summary                                
--------------------------------------------------------------------------
R                           0.924       RMSE                    32079.481 
R-Squared                   0.854       MSE                1029093133.260 
Adj. R-Squared              0.848       Coef. Var                   8.223 
Pred R-Squared              0.836       AIC                      1352.620 
MAE                     28012.693       SBC                      1360.792 
--------------------------------------------------------------------------
 RMSE: Root Mean Square Error 
 MSE: Mean Square Error 
 MAE: Mean Absolute Error 
 AIC: Akaike Information Criteria 
 SBC: Schwarz Bayesian Criteria 

                                      ANOVA                                        
----------------------------------------------------------------------------------
                        Sum of                                                    
                       Squares        DF         Mean Square       F         Sig. 
----------------------------------------------------------------------------------
Regression    341892090457.686         2    170946045228.843     157.37    0.0000 
Residual       58658308595.823        54      1086264973.997                      
Total         400550399053.509        56                                          
----------------------------------------------------------------------------------

                                         Parameter Estimates                                          
-----------------------------------------------------------------------------------------------------
       model          Beta    Std. Error    Std. Beta      t        Sig          lower         upper 
-----------------------------------------------------------------------------------------------------
 (Intercept)    137549.093     17620.447                  7.806    0.000    102222.224    172875.963 
 Square_Feet       137.879        10.836        0.670    12.725    0.000       116.155       159.602 
RemodeledYes     90917.216      8834.268        0.542    10.291    0.000     73205.575    108628.858 
-----------------------------------------------------------------------------------------------------

A Comment About Formatted Output

In HW 6 and below I use R coding to format the output to make it easier to read.
The values are IDENTICAL to the unformatted output.
Note: Formatted Output will differ in appearance depending on where it is viewed, i.e. slides, html file, or .Qmd file.

Formatted Abridged Output (Similar to HW 6)

model	Beta	Std.Error	t
(Intercept)	137549.09	17620.45	7.81
Square_Feet	137.88	10.84	12.72
RemodeledYes	90917.22	8834.27	10.29

Quick Review of Categorical Regression

On Tuesday we covered the Parallel Lines Model:
- A Parallel Lines model has two X variables, one quantitative and one categorical variable.
- Model estimates a separate SLR model for each category in the categorical variable.
- Model assumes all categories have the same SLOPE.
- Model estimates a separate INTERCEPT for each category.
- Model output shows results of a hypothesis test to determine if each non-baseline category’s intercept is significantly different from baseline intercept.

Interactive Plot of House Remodel Data

Calculations from House Model

By default R chooses baseline categories alphabetically
- No is before Yes so un-Remodeled houses are the baseline
- Un-Remodeled (No) SLR Model:
  - Est. Price = 137549.093 + 137.879 * Square_Feet
- Remodeled (Yes) SLR Model:
  - Est. Price = 137549.093 + 137.879 * Square_Feet + 90917.216
  - Est. Price = 137549.093 + 90917.216 + 137.879 * Square_Feet
  - Est. Price = 228466.3 + 137.879 * Square_Feet
Interpretation:
- Prices of remodeled houses are about 91 thousand dollars more than similar houses without remodeling, after accounting for square footage.
- This difference is statistically significant (P-value < 0.001)

HW 6 - Questions 1 - 6

This part of HW 6 examines data similar to the House-Remodel data examined in Lecture 13 and the review question.
The dataset is smaller and the numbers are different, but the questions are essentially the same.

Categorical Regression with Interactions

The categorical models covered so far assume that the SLR models for all categories have the same slope.
How do we examine that assumption?
For example:
- In the celebrity data in Lecture 13, the data showed a decrease in earnings as they got older.
- Slope was assumed to be IDENTICAL for both males and females
- That may not be true for all celebrities.
In the following small dataset, we will look at male celebrities only and examine if actors and athletes salaries follow the same trend.

Import and Examine Celebrity Profession Data

Code

```{r import and examine celeb_prof data, echo=T}
# import and examine celeb profession dataset
celeb_prof <- read_csv("data/celeb_prof.csv", show_col_types=F) 
head(celeb_prof) |> kable()

# use table to summarize data by category
celeb_prof |> select(Profession) |> table()
```

Celebrity	Earnings	Age	Profession
Jim Parsons	29	44	Actor
Johnny Depp	48	53	Actor
Tom Cruise	53	55	Actor
Leonardo Dicaprio	29	43	Actor
Jackie Chan	61	62	Actor
Mark Wahlberg	32	45	Actor

Profession
  Actor Athlete 
      8       8

Examine Correlations in Celebrity Professions Data

Note: If categories have different slopes, correlations for whole dataset will be misleading.

Code

celeb_prof |> select(Earnings, Age) |> cor() |> round(2) # all data

         Earnings   Age
Earnings     1.00 -0.46
Age         -0.46  1.00

Code

celeb_prof |> filter(Profession=="Actor") |>             # actors only
  select(Earnings, Age) |> cor() |> round(2)

         Earnings  Age
Earnings     1.00 0.99
Age          0.99 1.00

Code

celeb_prof |> filter(Profession=="Athlete") |>           # athletes only
  select(Earnings, Age) |> cor() |> round(2)

         Earnings   Age
Earnings     1.00 -0.98
Age         -0.98  1.00

Explore and Plot Data

Scatter plot shows that a regression model should be created with
- Different intercepts for each profession
- Different slopes for each profession

Interactive Model Plot - Celebrity Professions

Regression Model - Celebrity Professions

Now that we understand the data and linear trends, we can examine and interpret the regression model output.

                        Model Summary                          
--------------------------------------------------------------
R                       0.987       RMSE                2.640 
R-Squared               0.974       MSE                 6.968 
Adj. R-Squared          0.967       Coef. Var           6.058 
Pred R-Squared          0.951       AIC                86.467 
MAE                     2.265       SBC                90.330 
--------------------------------------------------------------
 RMSE: Root Mean Square Error 
 MSE: Mean Square Error 
 MAE: Mean Absolute Error 
 AIC: Akaike Information Criteria 
 SBC: Schwarz Bayesian Criteria 

                                ANOVA                                 
---------------------------------------------------------------------
                Sum of                                               
               Squares        DF    Mean Square       F         Sig. 
---------------------------------------------------------------------
Regression    4131.949         3       1377.316    148.246    0.0000 
Residual       111.489        12          9.291                      
Total         4243.437        15                                     
---------------------------------------------------------------------

                                         Parameter Estimates                                           
------------------------------------------------------------------------------------------------------
                model       Beta    Std. Error    Std. Beta       t        Sig       lower      upper 
------------------------------------------------------------------------------------------------------
          (Intercept)    -50.297         9.054                  -5.555    0.000    -70.023    -30.571 
                  Age      1.824         0.179        0.983     10.170    0.000      1.433      2.215 
    ProfessionAthlete    227.218        12.389        0.263     18.340    0.000    200.224    254.212 
Age:ProfessionAthlete     -5.063         0.293       -1.487    -17.294    0.000     -5.701     -4.425 
------------------------------------------------------------------------------------------------------

Model Interpretation - Interpreting model coefficients (betas)

Baseline category is first alphabetically
- Actor comes before Athlete in the alphabet so Actor is the baseline category.
- **Actor SLR Model:
  - Earnings = -50.297 + 1.824*Age**
ProfessionAthlete term: Difference from Actor (baseline) model Intercept
Age:ProfessionAthlete term: Difference from Actor model Slope
- Athlete SLR Model requires some calculations:
  - Earnings = -50.297 + 1.824*Age + 227.218 - 5.063*Age
  - Earnings = (-50.297 + 227.218) + (1.824 - 5.063)*Age
  - Earnings = 176.921 - ____*Age
  - Athlete SLR Model: Earnings = 176.921 - ____*Age

Lecture 14 In-class Exercises - Q2

What is the slope term (estimated beta for Age) for the Athlete SLR model?

Specify answer to two decimal places.

Abridged Output

ProfessionAthlete term: Difference from Actor (baseline) model Intercept
Age:ProfessionAthlete term: Difference from Actor model Slope
- Earnings = (-50.297 + 227.218) + (1.824 - 5.063)*Age
- Athlete SLR Model: Earnings = 176.921 - ____*Age

Determining Statistical Significance

In this case, we don’t need to examine the P-values because the model differences between groups are so clear (but we will).
The two intercepts and two slopes are VERY different.
Reminder of Hypothesis Testing concepts:
- The SMALLER the P-value, the more evidence there is that the true value of Beta (model term) is not zero.
- If this sentence is not clear to you, you are responsible for reviewing the Review Materials on Hypothesis Tests and Significance Tests (and other related topics):

Conclusions from Actor and Athlete Model

Abridged Output

P-value for difference in intercepts (ProfessionAthlete): < 0.001
- Actors SLR model and Athletes SLR model intercepts are significantly different.
P-value for difference in slopes (Age:ProfessionAthlete): < 0.001
- Actors SLR model and Athletes SLR model slopes are significantly different.
These interpretations are in agreement with what we can easily see in the Interactive Model Plot.

Movie Genres and Costs

Is length of a movie (Runtime) a good predictor of the movie budget?
Does the relationship between movie length and budget differ by movie genre?

Import Movie Data

Code

```{r import movies data, echo=T}
movies <- read_csv("data/movies.csv", show_col_types = F)  # Import and examine data
head(movies, 4) |> kable()

movies |> select(Genre) |> table()  # use table to examine categories in the data
```

Movie	Genre	Budget	Runtime
Paranormal Activity 3	Suspense / Horror	5	83
The Others	Suspense / Horror	17	104
The Lincoln Lawyer	Suspense / Horror	40	118
Fright Night	Suspense / Horror	30	106

Genre
           Action Suspense / Horror 
               10                10

Examine Correlations in Movie Data

Note: If categories have different slopes, correlations for whole dataset will be misleading.

Code

```{r correlations movie data, echo=T}
movies |> select(Budget, Runtime) |> cor() |> round(2) # all data

movies |> filter(Genre == "Action") |>                 # action movies
  select(Budget, Runtime) |> cor() |> round(2)

movies |> filter(Genre == "Suspense / Horror") |>      # suspense/horror movies
  select(Budget, Runtime) |> cor() |> round(2)
```

        Budget Runtime
Budget    1.00    0.76
Runtime   0.76    1.00
        Budget Runtime
Budget    1.00    0.95
Runtime   0.95    1.00
        Budget Runtime
Budget    1.00    0.99
Runtime   0.99    1.00

Explore and Plot Data

Scatter plot shows that a regression model should be created with
- Different intercepts for each genre
- Different slopes for each genre

Interactive Model Plot - Movie Data

Movie Genres Regression Model

Again, we can examine and interpret the regression model output.

Abridged Output

Lecture 14 In-class Exercises - Q3

What is the intercept term (estimated beta) for the Suspense / Horror SLR model?

Specify answer to two decimal places.

Abridged Output

GenreSuspense / Horror term: Difference from Action (baseline) model Intercept
Runtime:GenreSuspense / Horror term: Difference from Action model Slope
- Budget = (-286.67 + 218.75) + (3.25 - 2.35)*Runtime
- Suspense / Horror SLR Model: Budget = ____ - 0.9*Runtime

Determining Statistical Significance

In this case, we don’t need to examine the P-values because the model differences between groups are so clear (but we will).
The two intercepts and two slopes are VERY different.
Reminder of Hypothesis Testing concepts:
- The SMALLER the P-value, the more evidence there is that the true value of Beta (model term) is not zero.
- If this sentence is not clear to you, you are responsible for reviewing the Review Materials on Hypothesis Tests and Significance Tests (and other related topics):

Conclusions from Movie Genre Model

Abridged Output

P-value for diff. in intercepts (GenreSuspense / Horror): < 0.001
- Intercepts for these two distinct genre SLR models are _____ (Next Question).
P-value for diff. in slopes (Runtime:GenreSuspense / Horror term): < 0.001
- Slopes for for these two distinct genre SLR models are _____ (Next Question).
These interpretations are in agreement with what we can easily see in the Interactive Model Plot.

Lecture 14 In-class Exercises - Q4

Abridged Output

The difference in intercepts and the difference in slopes between the model for Action movies and the model for Suspense / Horror movies are both _____.

A. statistically significant

B. statistically insignificant

Lecture 14 In-class Exercises - Q5

Recall that the cutoff for determining significance of a regression model term based on it’s P-value is 0.05.

Fill in the blank:

The smaller the P-value, the _____ evidence there is that the Beta coefficient is non-zero and the term is useful to the model.

HW 6 - Questions 7 - 16

Dataset has three categories of Diamonds:
- Colorless, Faint yellow, and Nearly colorless
Colorless is first alphabetically so that is the baseline category by default.
- Each color category has unique intercept AND a unique slope.
- The interactive model plot and abridged regression output are provided.
- All Blackboard questions can be answered by rendering .qmd file to examine .html output.
Helpful TIP: In addition to other recommended options, change preview option (see next slide).

HW 6 - Change HTML Preview Option

For HW 6 you do not have to write any R code.
Instead you are expected to correctly interpret provided output.
Quiz 2 will have similar output WITHOUT the interactive plots.
Change the following option in the Basic tab of the R Markdown options:

Show output preview in Viewer Pane — Show output preview in **Viewer Pane**

Looking Ahead - What’s Next?

This week, ALL of the categorical models could be simplified to multiple SLRs, with same or different slopes.
ALL of the variables have had P-values less than 0.05 so the terms were all useful.
There are many many model options where these two facts are not true.
Time permitting, here’s a brief look at a dataset with many explanatory variables:

Charges	ln_charges	Age	Sex	BMI	Children	Smoker	Region
16884.924	9.734176	19	female	27.90	0	yes	southwest
1725.552	7.453303	18	male	33.77	1	no	southeast
4449.462	8.400538	28	male	33.00	3	no	southeast

Insurance Data Model and Variable Selection

There are 3 quantitative variables:
- Age, BMI, and Children
There are 3 categorical variables:
- Sex, Smoker, Region
There are literally hundreds of possible models including interaction terms.
Note that an interaction can also be between two quantitative variables.
You can also have interaction terms with three variables (but I try to avoid those).
How do we sort through all of the possible options?
- Software helps us pare down all the possible models to a few choices.
- Analyst then uses critical thinking and examination of data to determine final model.
Model and variable selection methods are the next set of topics.

Key Points from Today

Categorical Interaction Model
- Separate SLR for each group.
- BOTH slopes and intercepts can differ by category
- We can test if interaction term (slope difference) is significant.
Next Topics
- Comparing model goodness of fit
- Introduction to variable selection

HW 6 is now available and is due on Wed. 3/6.
Date of Quiz 2 has been changed to Tuesday, 4/1.

To submit an Engagement Question or Comment about material from Lecture 14: Submit it by midnight today (day of lecture).

--- title: "BUA 345 - Lecture 14" subtitle: "Categorical Regression - Interaction Model" author: "Penelope Pooler Eisenbies" date: last-modified lightbox: true toc: true toc-depth: 3 toc-location: left toc-title: "Table of Contents" toc-expand: 1 format: html: code-line-numbers: true code-fold: true code-tools: true execute: echo: fenced --- ## Housekeeping ```{r setup, echo=FALSE, warning=F, message=F, include=F} #| include: false # this line specifies options for default options for all R Chunks knitr::opts_chunk$set(echo=F) # suppress scientific notation options(scipen=100) # install helper package that loads and installs other packages, if needed if (!require("pacman")) install.packages("pacman", repos = "http://lib.stat.cmu.edu/R/CRAN/") # install and load required packages pacman::p_load(pacman,tidyverse, magrittr, olsrr, shadowtext, mapproj, knitr, kableExtra, countrycode, usdata, maps, RColorBrewer, gridExtra, ggthemes, gt, tools, ggiraphExtra) # verify packages # p_loaded() ``` **HW 5 was due 2/26/2025** - 2 day grace period **HW 6 was due 3/5/2025** - 2 day grace period ::: nonincremental - Demo videos will be posted this weekend. ::: **Quiz 2 will be on 4/1/2025** - Date has changed and syllabus has been updated. ### Today's plan - Review Parallel Lines Model - Introduce Interaction term and Interaction Model - Work through how to interpret model output - Introduce HW 6 - Talk about next steps ::: fragment **In-class Polling (Session ID: bua345s25)** ::: ## ### Review Question - Import data ```{r import and examine house remodel data, echo=T} house_remodel <- read_csv("data/house_remodel.csv", show_col_types = F) head(house_remodel, 3) |> kable() house_remodel |> select(Remodeled) |> table() # number of obs by category house_remodel |> select(Price, Square_Feet) |> cor() |> round(2) # correlation of price & sq. ft. ``` ## ### Lecture 14 In-class Exercises - Q1 - Review Based on the `Parameter Estimates` table for the specified categorical regression model, which category is the baseline category? ::: r-fit-text ```{r house model specified} (house_rem_cat_ols<- ols_regress(Price ~ Square_Feet + Remodeled, data=house_remodel)) ``` ::: ## ### A Comment About Formatted Output - In HW 6 and below I use R coding to format the output to make it easier to read. - The values are IDENTICAL to the unformatted output. - Note: Formatted Output will differ in appearance depending on where it is viewed, i.e. slides, html file, or .Qmd file. ::: fragment #### Formatted Abridged Output (Similar to HW 6) ```{r house model formal param estimated kable table, echo=F} # formatted regression output # model is saved and printed to screen house_rem_cat_ols<- ols_regress(Price ~ Square_Feet + Remodeled, data=house_remodel) model_out <- tibble(house_rem_cat_ols$mvars, # select columns from saved model house_rem_cat_ols$betas, house_rem_cat_ols$std_errors, house_rem_cat_ols$tvalues, house_rem_cat_ols$pvalues) |> mutate(`model` = `house_rem_cat_ols$mvars`, # round and rename columns `Beta` = round(`house_rem_cat_ols$betas`,2), `Std.Error` = round(`house_rem_cat_ols$std_errors`,2), `t` = round(`house_rem_cat_ols$tvalues`,2), `Sig` = round(`house_rem_cat_ols$pvalues`,4)) |> select(6:10) # select output columns model_out |> kable() ``` ::: ## ### Quick Review of Categorical Regression - On Tuesday we covered the `Parallel Lines` Model: - A Parallel Lines model has two X variables, one quantitative and one categorical variable. - Model estimates a separate SLR model for each category in the categorical variable. - Model assumes all categories have the same **SLOPE**. - Model estimates a separate **INTERCEPT** for each category. - Model output shows results of a hypothesis test to determine if each non-baseline category's intercept is significantly different from baseline intercept. ## Interactive Plot of House Remodel Data ```{r house remodel mlr model, echo=F} # mlr model created using lm rem_cat_lm <- lm(Price ~ Square_Feet + Remodeled, data=house_remodel) # create and save interactive plot (int_rem_mlr <- ggPredict(rem_cat_lm, interactive=T)) ``` ## Calculations from House Model - By default R chooses baseline categories alphabetically - `No` is before `Yes` so un-`Remodeled` houses are the baseline - Un-Remodeled (`No`) SLR Model: - `Est. Price = 137549.093 + 137.879 * Square_Feet` - Remodeled (`Yes`) SLR Model: - `Est. Price = 137549.093 + 137.879 * Square_Feet + 90917.216` - `Est. Price = 137549.093 + 90917.216 + 137.879 * Square_Feet` - `Est. Price = 228466.3 + 137.879 * Square_Feet` - **Interpretation:** - Prices of remodeled houses are about ***91 thousand dollars more than similar houses without remodeling***, after accounting for square footage. - This difference is statistically significant (P-value \< 0.001) ## HW 6 - Questions 1 - 6 - This part of HW 6 examines data similar to the House-Remodel data examined in Lecture 13 and the review question. - The dataset is smaller and the numbers are different, but the questions are essentially the same. ::: fragment ![](img/hw6_parameter_est1_.png){fig-align="center"} ::: ## ### Categorical Regression with Interactions - The categorical models covered so far assume that the `SLR` models for all categories have the same slope. - How do we examine that assumption? - For example: - In the celebrity data in Lecture 13, the data showed a decrease in earnings as they got older. - Slope was assumed to be IDENTICAL for both males and females - That may not be true for all celebrities. - In the following small dataset, we will look at male celebrities only and examine if actors and athletes salaries follow the same trend. ## ### Import and Examine Celebrity Profession Data ```{r import and examine celeb_prof data, echo=T} # import and examine celeb profession dataset celeb_prof <- read_csv("data/celeb_prof.csv", show_col_types=F) head(celeb_prof) |> kable() # use table to summarize data by category celeb_prof |> select(Profession) |> table() ``` ## ### Examine Correlations in Celebrity Professions Data **Note:** If categories have different slopes, correlations for whole dataset will be misleading. ```{r correlations celeb_prof data, echo=TRUE} celeb_prof |> select(Earnings, Age) |> cor() |> round(2) # all data celeb_prof |> filter(Profession=="Actor") |> # actors only select(Earnings, Age) |> cor() |> round(2) celeb_prof |> filter(Profession=="Athlete") |> # athletes only select(Earnings, Age) |> cor() |> round(2) ``` ## Explore and Plot Data ::::: columns ::: {.column width="40%"} - Scatter plot shows that a regression model should be created with - Different intercepts for each profession - Different slopes for each profession ```{r exploratory scatter plot code for celeb_prof data, echo=F} celeb_sctrplot <- celeb_prof |> #scatterplot code ggplot() + geom_point(aes(x=Age, y=Earnings, color=Profession), size=4) + theme_classic() + theme(axis.title = element_text(size=18), axis.text = element_text(size=15), legend.text = element_text(size=10), plot.background = element_rect(colour = "darkgrey", fill=NA, linewidth=2)) ``` ::: ::: {.column width="60%"} ```{r show celeb scatterplot, fig.dim=c(7,6) , echo=F} celeb_sctrplot ``` ::: ::::: ## ### Interactive Model Plot - Celebrity Professions ```{r celeb profession model plot, echo=FALSE} # create linear model with categories an interaction celeb_prof_lm <- lm(Earnings ~ Age + Profession + Age*Profession, data=celeb_prof) # create interactive plot of model # copy and paste this into console (below) to see plot in viewer (clb_int <- ggPredict(celeb_prof_lm, interactive = T)) ``` ## ### Regression Model - Celebrity Professions Now that we understand the data and linear trends, we can examine and interpret the regression model output. ::: r-fit-text ```{r celeb_prof regression} (celeb_interaction_ols<- ols_regress(Earnings ~ Age + Profession + Age*Profession, data=celeb_prof, iterm = T)) ``` ::: ## ### Model Interpretation - Interpreting model coefficients (betas) - **Baseline category is first alphabetically** - Actor comes before Athlete in the alphabet so **Actor is the baseline category**. - \*\*Actor SLR Model: - `Earnings = -50.297 + 1.824*Age`\*\* - **`ProfessionAthlete` term:** Difference from Actor (baseline) model Intercept - **`Age:ProfessionAthlete` term:** Difference from Actor model Slope - Athlete SLR Model requires some calculations: - `Earnings = -50.297 + 1.824*Age + 227.218 - 5.063*Age` - `Earnings = (-50.297 + 227.218) + (1.824 - 5.063)*Age` - `Earnings = 176.921 - ____*Age` - ***Athlete SLR Model:*** `Earnings = 176.921 - ____*Age` ## ### Lecture 14 In-class Exercises - Q2 **What is the slope term (estimated `beta` for Age) for the Athlete SLR model?** Specify answer to two decimal places. #### Abridged Output ![](img/clb_prof_int_mlr.png){fig-align="center"} - **`ProfessionAthlete` term:** Difference from Actor (baseline) model Intercept - **`Age:ProfessionAthlete` term:** Difference from Actor model Slope - `Earnings = (-50.297 + 227.218) + (1.824 - 5.063)*Age` - ***Athlete SLR Model:*** `Earnings = 176.921 - ____*Age` ## Determining Statistical Significance - In this case, we don't need to examine the P-values because the model differences between groups are so clear (but we will). - The two intercepts and two slopes are **VERY** different. - Reminder of Hypothesis Testing concepts: - **The SMALLER the P-value, the more evidence there is that the true value of Beta (model term) is not zero.** - If this sentence is not clear to you, **you are responsible for reviewing the Review Materials** on Hypothesis Tests and Significance Tests (and other related topics): - [**MAS 261 Lecture Slides and Notes**](https://penelope2040.quarto.pub/mas-261/#most-recent-lecture-material){target="_blank"} - [**Khan Academy - The Idea of Significance tests**](https://www.khanacademy.org/math/statistics-probability/significance-tests-one-sample/idea-of-significance-tests/v/simple-hypothesis-testing){target="_blank"} - [**Khan Academy - Comparing a P-value to a Significance level**](https://www.khanacademy.org/math/statistics-probability/significance-tests-one-sample/tests-about-population-mean/v/comparing-p-value-from-t-statistic-to-significance-level?modal=1){target="_blank"} ## ### Conclusions from Actor and Athlete Model #### Abridged Output ![](img/clb_prof_int_mlr.png){fig-align="center"} - **P-value for difference in intercepts (`ProfessionAthlete`):** \< 0.001 - Actors SLR model and Athletes SLR model intercepts are significantly different. - **P-value for difference in slopes (`Age:ProfessionAthlete`):** \< 0.001 - Actors SLR model and Athletes SLR model slopes are significantly different. - **These interpretations are in agreement with what we can easily see in the Interactive Model Plot.** ## Movie Genres and Costs - Is length of a movie (`Runtime`) a good predictor of the movie budget? - Does the relationship between movie length and budget differ by movie genre? ::: fragment #### Import Movie Data ```{r import movies data, echo=T} movies <- read_csv("data/movies.csv", show_col_types = F) # Import and examine data head(movies, 4) |> kable() movies |> select(Genre) |> table() # use table to examine categories in the data ``` ::: ## Examine Correlations in Movie Data **Note:** If categories have different slopes, correlations for whole dataset will be misleading. ```{r correlations movie data, echo=T} movies |> select(Budget, Runtime) |> cor() |> round(2) # all data movies |> filter(Genre == "Action") |> # action movies select(Budget, Runtime) |> cor() |> round(2) movies |> filter(Genre == "Suspense / Horror") |> # suspense/horror movies select(Budget, Runtime) |> cor() |> round(2) ``` ## Explore and Plot Data ::::: columns ::: {.column width="40%"} - Scatter plot shows that a regression model should be created with - Different intercepts for each genre - Different slopes for each genre ```{r exploratory scatter plot code for movie data, echo=F} movie_scrtplot <- movies |> ggplot() + geom_point(aes(x=Runtime, y=Budget, color=Genre), size=4) + theme_classic() + labs(x="Run time (Movie Length)") + theme(axis.title = element_text(size=18), axis.text = element_text(size=15), legend.text = element_text(size=10), plot.background = element_rect(colour = "darkgrey", fill=NA, linewidth=2)) ``` ::: ::: {.column width="60%"} ```{r show movie scatterplot, fig.dim=c(7,6), echo=F} movie_scrtplot ``` ::: ::::: ## Interactive Model Plot - Movie Data ```{r movie model plot, echo=F} # create linear model with categories an interaction movie_cat_lm <- lm(Budget ~ Runtime + Genre + Runtime*Genre, data=movies) # create interactive plot of model # copy and paste this into console (below) to see plot in viewer (movie_int <- ggPredict(movie_cat_lm, interactive = T)) ``` ## ### Movie Genres Regression Model Again, we can examine and interpret the regression model output. ```{r movie regression, results='hide'} (movie_interaction_ols<- ols_regress(Budget ~ Runtime + Genre + Runtime*Genre, data=movies, iterm = T)) ``` #### Abridged Output ![](img/movie_int_mlr.png){fig-align="center"} ## ### Lecture 14 In-class Exercises - Q3 **What is the intercept term (estimated `beta`) for the Suspense / Horror SLR model?** - Specify answer to two decimal places. ::: fragment #### Abridged Output ![](img/movie_int_mlr.png){fig-align="center"} ::: - **`GenreSuspense / Horror` term:** Difference from Action (baseline) model Intercept - **`Runtime:GenreSuspense / Horror term`:** Difference from Action model Slope - `Budget = (-286.67 + 218.75) + (3.25 - 2.35)*Runtime` - ***Suspense / Horror SLR Model:*** `Budget = ____ - 0.9*Runtime` ## Determining Statistical Significance - In this case, we don't need to examine the P-values because the model differences between groups are so clear (but we will). - The two intercepts and two slopes are **VERY** different. - Reminder of Hypothesis Testing concepts: - **The SMALLER the P-value, the more evidence there is that the true value of Beta (model term) is not zero.** - If this sentence is not clear to you, **you are responsible for reviewing the Review Materials** on Hypothesis Tests and Significance Tests (and other related topics): - [**MAS 261 Lecture Slides and Notes**](https://penelope2040.quarto.pub/mas-261/#most-recent-lecture-material){target="_blank"} - [**Khan Academy - The Idea of Significance tests**](https://www.khanacademy.org/math/statistics-probability/significance-tests-one-sample/idea-of-significance-tests/v/simple-hypothesis-testing){target="_blank"} - [**Khan Academy - Comparing a P-value to a Significance level**](https://www.khanacademy.org/math/statistics-probability/significance-tests-one-sample/tests-about-population-mean/v/comparing-p-value-from-t-statistic-to-significance-level?modal=1){target="_blank"} ## ### Conclusions from Movie Genre Model #### Abridged Output ![](img/movie_int_mlr.png){fig-align="center"} - **P-value for diff. in intercepts (`GenreSuspense / Horror`):** \< 0.001 - Intercepts for these two distinct genre SLR models are `_____` (Next Question). - **P-value for diff. in slopes (`Runtime:GenreSuspense / Horror term`):** \< 0.001 - Slopes for for these two distinct genre SLR models are `_____` (Next Question). - **These interpretations are in agreement with what we can easily see in the Interactive Model Plot.** ## ### Lecture 14 In-class Exercises - Q4 #### Abridged Output ![](img/movie_int_mlr.png){fig-align="center"} **The difference in intercepts and the difference in slopes between the model for Action movies and the model for Suspense / Horror movies are both `_____`.** ::: nonincremental A. statistically significant B. statistically insignificant ::: ## ### Lecture 14 In-class Exercises - Q5 Recall that the cutoff for determining significance of a regression model term based on it's P-value is 0.05. <br> Fill in the blank: **The smaller the P-value, the `_____` evidence there is that the Beta coefficient is non-zero and the term is useful to the model.** ## ### HW 6 - Questions 7 - 16 - Dataset has three categories of Diamonds: - `Colorless`, `Faint yellow`, and `Nearly colorless` - **Colorless** is first alphabetically so that is the **baseline category** by default. - Each color category has unique intercept AND a unique slope. - The interactive model plot and abridged regression output are provided. - All Blackboard questions can be answered by rendering .qmd file to examine .html output. - **Helpful TIP:** In addition to other recommended options, change preview option (see next slide). ## ### HW 6 - Change HTML Preview Option - For HW 6 you do not have to write any R code. - Instead you are expected to correctly interpret provided output. - Quiz 2 will have similar output **WITHOUT the interactive plots**. - Change the following option in the `Basic` tab of the `R Markdown` options: ::: fragment ![Show output preview in **Viewer Pane**](img/show_output%20_viewer_pane.png){fig-align="center"} ::: ## Looking Ahead - What's Next? - This week, ALL of the categorical models could be simplified to multiple SLRs, with same or different slopes. - ALL of the variables have had P-values less than 0.05 so the terms were all useful. - **There are many many model options where these two facts are not true.** - Time permitting, here's a brief look at a dataset with many explanatory variables: ::: fragment ```{r brief look at another dataset} insurance <- read_csv("data/Insurance.csv", show_col_types=F) head(insurance, 3) |> kable() ``` ::: ## ### Insurance Data Model and Variable Selection - There are 3 quantitative variables: - **Age, BMI, and Children** - There are 3 categorical variables: - **Sex, Smoker, Region** - There are literally hundreds of possible models including interaction terms. - Note that an interaction can also be between two quantitative variables. - You can also have interaction terms with three variables (but I try to avoid those). - How do we sort through all of the possible options? - Software helps us pare down all the possible models to a few choices. - Analyst then uses critical thinking and examination of data to determine final model. - Model and variable selection methods are the next set of topics. ## ### Key Points from Today - **Categorical Interaction Model** - Separate SLR for each group. - BOTH slopes and intercepts can differ by category - We can test if interaction term (slope difference) is significant. - **Next Topics** - Comparing model goodness of fit - Introduction to variable selection <br> - **HW 6 is now available and is due on Wed. 3/6.** - **Date of Quiz 2 has been changed to Tuesday, 4/1.** ::: fragment **To submit an Engagement Question or Comment about material from Lecture 14:** Submit it by midnight today (day of lecture). :::