BUA 345 - Lecture 14

Categorical Regression - Interaction Model

Penelope Pooler Eisenbies

2025-02-27

Housekeeping

HW 5 was due 2/26/2025 - 2 day grace period

HW 6 was due 3/5/2025 - 2 day grace period

Demo videos will be posted this weekend.

Quiz 2 will be on 4/1/2025 - Date has changed and syllabus has been updated.

Today’s plan 📋

Review Parallel Lines Model
Introduce Interaction term and Interaction Model
Work through how to interpret model output
Introduce HW 6
Talk about next steps

In-class Polling (Session ID: bua345s25)

Review Question - Import data

house_remodel <- read_csv("data/house_remodel.csv", show_col_types = F)
head(house_remodel, 3) |> kable()

Price	Square_Feet	Remodeled
554000	2702	No
484000	2378	No
391000	1846	No

house_remodel |> select(Remodeled) |> table()                     # number of obs by category

Remodeled
 No Yes 
 29  28

house_remodel |> select(Price, Square_Feet) |> cor() |> round(2)  # correlation of price & sq. ft.

            Price Square_Feet
Price        1.00        0.75
Square_Feet  0.75        1.00

💥 Lecture 14 In-class Exercises - Q1 - Review 💥

Based on the Parameter Estimates table for the specified categorical regression model, which category is the baseline category?

                              Model Summary                                
--------------------------------------------------------------------------
R                           0.924       RMSE                    32079.481 
R-Squared                   0.854       MSE                1029093133.260 
Adj. R-Squared              0.848       Coef. Var                   8.223 
Pred R-Squared              0.836       AIC                      1352.620 
MAE                     28012.693       SBC                      1360.792 
--------------------------------------------------------------------------
 RMSE: Root Mean Square Error 
 MSE: Mean Square Error 
 MAE: Mean Absolute Error 
 AIC: Akaike Information Criteria 
 SBC: Schwarz Bayesian Criteria 

                                      ANOVA                                        
----------------------------------------------------------------------------------
                        Sum of                                                    
                       Squares        DF         Mean Square       F         Sig. 
----------------------------------------------------------------------------------
Regression    341892090457.686         2    170946045228.843     157.37    0.0000 
Residual       58658308595.823        54      1086264973.997                      
Total         400550399053.509        56                                          
----------------------------------------------------------------------------------

                                         Parameter Estimates                                          
-----------------------------------------------------------------------------------------------------
       model          Beta    Std. Error    Std. Beta      t        Sig          lower         upper 
-----------------------------------------------------------------------------------------------------
 (Intercept)    137549.093     17620.447                  7.806    0.000    102222.224    172875.963 
 Square_Feet       137.879        10.836        0.670    12.725    0.000       116.155       159.602 
RemodeledYes     90917.216      8834.268        0.542    10.291    0.000     73205.575    108628.858 
-----------------------------------------------------------------------------------------------------

A Comment About Formatted Output

In HW 6 and below I use R coding to format the output to make it easier to read.
The values are IDENTICAL to the unformatted output.
Note: Formatted Output will differ in appearance depending on where it is viewed, i.e. slides, html file, or .Qmd file.

Formatted Abridged Output (Similar to HW 6)

model	Beta	Std.Error	t
(Intercept)	137549.09	17620.45	7.81
Square_Feet	137.88	10.84	12.72
RemodeledYes	90917.22	8834.27	10.29

Quick Review of Categorical Regression

On Tuesday we covered the Parallel Lines Model:
- A Parallel Lines model has two X variables, one quantitative and one categorical variable.
- Model estimates a separate SLR model for each category in the categorical variable.
- Model assumes all categories have the same SLOPE.
- Model estimates a separate INTERCEPT for each category.
- Model output shows results of a hypothesis test to determine if each non-baseline category’s intercept is significantly different from baseline intercept.

Interactive Plot of House Remodel Data

Calculations from House Model

By default R chooses baseline categories alphabetically
- No is before Yes so un-Remodeled houses are the baseline
- Un-Remodeled (No) SLR Model:
  - Est. Price = 137549.093 + 137.879 * Square_Feet
- Remodeled (Yes) SLR Model:
  - Est. Price = 137549.093 + 137.879 * Square_Feet + 90917.216
  - Est. Price = 137549.093 + 90917.216 + 137.879 * Square_Feet
  - Est. Price = 228466.3 + 137.879 * Square_Feet
Interpretation:
- Prices of remodeled houses are about 91 thousand dollars more than similar houses without remodeling, after accounting for square footage.
- This difference is statistically significant (P-value < 0.001)

HW 6 - Questions 1 - 6

This part of HW 6 examines data similar to the House-Remodel data examined in Lecture 13 and the review question.
The dataset is smaller and the numbers are different, but the questions are essentially the same.

Categorical Regression with Interactions

The categorical models covered so far assume that the SLR models for all categories have the same slope.
How do we examine that assumption?
For example:
- In the celebrity data in Lecture 13, the data showed a decrease in earnings as they got older.
- Slope was assumed to be IDENTICAL for both males and females
- That may not be true for all celebrities.
In the following small dataset, we will look at male celebrities only and examine if actors and athletes salaries follow the same trend.

Import and Examine Celebrity Profession Data

# import and examine celeb profession dataset
celeb_prof <- read_csv("data/celeb_prof.csv", show_col_types=F) 
head(celeb_prof) |> kable()

Celebrity	Earnings	Age	Profession
Jim Parsons	29	44	Actor
Johnny Depp	48	53	Actor
Tom Cruise	53	55	Actor
Leonardo Dicaprio	29	43	Actor
Jackie Chan	61	62	Actor
Mark Wahlberg	32	45	Actor

# use table to summarize data by category
celeb_prof |> select(Profession) |> table()

Profession
  Actor Athlete 
      8       8

Examine Correlations in Celebrity Professions Data

Note: If categories have different slopes, correlations for whole dataset will be misleading.

celeb_prof |> select(Earnings, Age) |> cor() |> round(2) # all data

         Earnings   Age
Earnings     1.00 -0.46
Age         -0.46  1.00

celeb_prof |> filter(Profession=="Actor") |>             # actors only
  select(Earnings, Age) |> cor() |> round(2)

         Earnings  Age
Earnings     1.00 0.99
Age          0.99 1.00

celeb_prof |> filter(Profession=="Athlete") |>           # athletes only
  select(Earnings, Age) |> cor() |> round(2)

         Earnings   Age
Earnings     1.00 -0.98
Age         -0.98  1.00

Explore and Plot Data

Scatter plot shows that a regression model should be created with
- Different intercepts for each profession
- Different slopes for each profession

Interactive Model Plot - Celebrity Professions

Regression Model - Celebrity Professions

Now that we understand the data and linear trends, we can examine and interpret the regression model output.

                        Model Summary                          
--------------------------------------------------------------
R                       0.987       RMSE                2.640 
R-Squared               0.974       MSE                 6.968 
Adj. R-Squared          0.967       Coef. Var           6.058 
Pred R-Squared          0.951       AIC                86.467 
MAE                     2.265       SBC                90.330 
--------------------------------------------------------------
 RMSE: Root Mean Square Error 
 MSE: Mean Square Error 
 MAE: Mean Absolute Error 
 AIC: Akaike Information Criteria 
 SBC: Schwarz Bayesian Criteria 

                                ANOVA                                 
---------------------------------------------------------------------
                Sum of                                               
               Squares        DF    Mean Square       F         Sig. 
---------------------------------------------------------------------
Regression    4131.949         3       1377.316    148.246    0.0000 
Residual       111.489        12          9.291                      
Total         4243.437        15                                     
---------------------------------------------------------------------

                                         Parameter Estimates                                           
------------------------------------------------------------------------------------------------------
                model       Beta    Std. Error    Std. Beta       t        Sig       lower      upper 
------------------------------------------------------------------------------------------------------
          (Intercept)    -50.297         9.054                  -5.555    0.000    -70.023    -30.571 
                  Age      1.824         0.179        0.983     10.170    0.000      1.433      2.215 
    ProfessionAthlete    227.218        12.389        0.263     18.340    0.000    200.224    254.212 
Age:ProfessionAthlete     -5.063         0.293       -1.487    -17.294    0.000     -5.701     -4.425 
------------------------------------------------------------------------------------------------------

Model Interpretation - Interpreting model coefficients (betas)

Baseline category is first alphabetically
- Actor comes before Athlete in the alphabet so Actor is the baseline category.
- **Actor SLR Model:
  - Earnings = -50.297 + 1.824*Age**
ProfessionAthlete term: Difference from Actor (baseline) model Intercept
Age:ProfessionAthlete term: Difference from Actor model Slope
- Athlete SLR Model requires some calculations:
  - Earnings = -50.297 + 1.824*Age + 227.218 - 5.063*Age
  - Earnings = (-50.297 + 227.218) + (1.824 - 5.063)*Age
  - Earnings = 176.921 - ____*Age
  - Athlete SLR Model: Earnings = 176.921 - ____*Age

💥 Lecture 14 In-class Exercises - Q2 💥

What is the slope term (estimated beta for Age) for the Athlete SLR model?

Specify answer to two decimal places.

Abridged Output

ProfessionAthlete term: Difference from Actor (baseline) model Intercept
Age:ProfessionAthlete term: Difference from Actor model Slope
- Earnings = (-50.297 + 227.218) + (1.824 - 5.063)*Age
- Athlete SLR Model: Earnings = 176.921 - ____*Age

Determining Statistical Significance

In this case, we don’t need to examine the P-values because the model differences between groups are so clear (but we will).
The two intercepts and two slopes are VERY different.
Reminder of Hypothesis Testing concepts:
- The SMALLER the P-value, the more evidence there is that the true value of Beta (model term) is not zero.
- If this sentence is not clear to you, you are responsible for reviewing the Review Materials on Hypothesis Tests and Significance Tests (and other related topics):

Conclusions from Actor and Athlete Model

Abridged Output

P-value for difference in intercepts (ProfessionAthlete): < 0.001
- Actors SLR model and Athletes SLR model intercepts are significantly different.
P-value for difference in slopes (Age:ProfessionAthlete): < 0.001
- Actors SLR model and Athletes SLR model slopes are significantly different.
These interpretations are in agreement with what we can easily see in the Interactive Model Plot.

Movie Genres and Costs

Is length of a movie (Runtime) a good predictor of the movie budget?
Does the relationship between movie length and budget differ by movie genre?

Import Movie Data

movies <- read_csv("data/movies.csv", show_col_types = F)  # Import and examine data
head(movies, 4) |> kable()

Movie	Genre	Budget	Runtime
Paranormal Activity 3	Suspense / Horror	5	83
The Others	Suspense / Horror	17	104
The Lincoln Lawyer	Suspense / Horror	40	118
Fright Night	Suspense / Horror	30	106

movies |> select(Genre) |> table()  # use table to examine categories in the data

Genre
           Action Suspense / Horror 
               10                10

Examine Correlations in Movie Data

Note: If categories have different slopes, correlations for whole dataset will be misleading.

movies |> select(Budget, Runtime) |> cor() |> round(2) # all data

        Budget Runtime
Budget    1.00    0.76
Runtime   0.76    1.00

movies |> filter(Genre == "Action") |>                 # action movies
  select(Budget, Runtime) |> cor() |> round(2)

        Budget Runtime
Budget    1.00    0.95
Runtime   0.95    1.00

movies |> filter(Genre == "Suspense / Horror") |>      # suspense/horror movies
  select(Budget, Runtime) |> cor() |> round(2)

        Budget Runtime
Budget    1.00    0.99
Runtime   0.99    1.00

Explore and Plot Data

Scatter plot shows that a regression model should be created with
- Different intercepts for each genre
- Different slopes for each genre

Interactive Model Plot - Movie Data

Movie Genres Regression Model

Again, we can examine and interpret the regression model output.

Abridged Output

💥 Lecture 14 In-class Exercises - Q3 💥

What is the intercept term (estimated beta) for the Suspense / Horror SLR model?

Specify answer to two decimal places.

Abridged Output

GenreSuspense / Horror term: Difference from Action (baseline) model Intercept
Runtime:GenreSuspense / Horror term: Difference from Action model Slope
- Budget = (-286.67 + 218.75) + (3.25 - 2.35)*Runtime
- Suspense / Horror SLR Model: Budget = ____ - 0.9*Runtime

Determining Statistical Significance

In this case, we don’t need to examine the P-values because the model differences between groups are so clear (but we will).
The two intercepts and two slopes are VERY different.
Reminder of Hypothesis Testing concepts:
- The SMALLER the P-value, the more evidence there is that the true value of Beta (model term) is not zero.
- If this sentence is not clear to you, you are responsible for reviewing the Review Materials on Hypothesis Tests and Significance Tests (and other related topics):

Conclusions from Movie Genre Model

Abridged Output

P-value for diff. in intercepts (GenreSuspense / Horror): < 0.001
- Intercepts for these two distinct genre SLR models are _____ (Next Question).
P-value for diff. in slopes (Runtime:GenreSuspense / Horror term): < 0.001
- Slopes for for these two distinct genre SLR models are _____ (Next Question).
These interpretations are in agreement with what we can easily see in the Interactive Model Plot.

💥 Lecture 14 In-class Exercises - Q4 💥

Abridged Output

The difference in intercepts and the difference in slopes between the model for Action movies and the model for Suspense / Horror movies are both _____.

A. statistically significant

B. statistically insignificant

💥 Lecture 14 In-class Exercises - Q5 💥

Recall that the cutoff for determining significance of a regression model term based on it’s P-value is 0.05.

Fill in the blank:

The smaller the P-value, the _____ evidence there is that the Beta coefficient is non-zero and the term is useful to the model.

HW 6 - Questions 7 - 16

Dataset has three categories of Diamonds:
- Colorless, Faint yellow, and Nearly colorless
Colorless is first alphabetically so that is the baseline category by default.
- Each color category has unique intercept AND a unique slope.
- The interactive model plot and abridged regression output are provided.
- All Blackboard questions can be answered by rendering .qmd file to examine .html output.
Helpful TIP: In addition to other recommended options, change preview option (see next slide).

HW 6 - Change HTML Preview Option

For HW 6 you do not have to write any R code.
Instead you are expected to correctly interpret provided output.
Quiz 2 will have similar output WITHOUT the interactive plots.
Change the following option in the Basic tab of the R Markdown options:

Show output preview in Viewer Pane — Show output preview in **Viewer Pane**

Looking Ahead - What’s Next?

This week, ALL of the categorical models could be simplified to multiple SLRs, with same or different slopes.
ALL of the variables have had P-values less than 0.05 so the terms were all useful.
There are many many model options where these two facts are not true.
Time permitting, here’s a brief look at a dataset with many explanatory variables:

Charges	ln_charges	Age	Sex	BMI	Children	Smoker	Region
16884.924	9.734176	19	female	27.90	0	yes	southwest
1725.552	7.453303	18	male	33.77	1	no	southeast
4449.462	8.400538	28	male	33.00	3	no	southeast

Insurance Data Model and Variable Selection

There are 3 quantitative variables:
- Age, BMI, and Children
There are 3 categorical variables:
- Sex, Smoker, Region
There are literally hundreds of possible models including interaction terms.
Note that an interaction can also be between two quantitative variables.
You can also have interaction terms with three variables (but I try to avoid those).
How do we sort through all of the possible options?
- Software helps us pare down all the possible models to a few choices.
- Analyst then uses critical thinking and examination of data to determine final model.
Model and variable selection methods are the next set of topics.

Key Points from Today

Categorical Interaction Model
- Separate SLR for each group.
- BOTH slopes and intercepts can differ by category
- We can test if interaction term (slope difference) is significant.
Next Topics
- Comparing model goodness of fit
- Introduction to variable selection

HW 6 is now available and is due on Wed. 3/6.
Date of Quiz 2 has been changed to Tuesday, 4/1.

To submit an Engagement Question or Comment about material from Lecture 14: Submit it by midnight today (day of lecture).