BUA 345 - Lecture 14

Categorical Regression - Interaction Model

Author

Penelope Pooler Eisenbies

Published

February 27, 2025

Housekeeping

HW 5 was due 2/26/2025 - 2 day grace period

HW 6 was due 3/5/2025 - 2 day grace period

  • Demo videos will be posted this weekend.

Quiz 2 will be on 4/1/2025 - Date has changed and syllabus has been updated.

Today’s plan

  • Review Parallel Lines Model

  • Introduce Interaction term and Interaction Model

  • Work through how to interpret model output

  • Introduce HW 6

  • Talk about next steps

In-class Polling (Session ID: bua345s25)

Review Question - Import data

Code
```{r import and examine house remodel data, echo=T}
house_remodel <- read_csv("data/house_remodel.csv", show_col_types = F)
head(house_remodel, 3) |> kable()

house_remodel |> select(Remodeled) |> table()                     # number of obs by category
house_remodel |> select(Price, Square_Feet) |> cor() |> round(2)  # correlation of price & sq. ft.
```
Price Square_Feet Remodeled
554000 2702 No
484000 2378 No
391000 1846 No
Remodeled
 No Yes 
 29  28 
            Price Square_Feet
Price        1.00        0.75
Square_Feet  0.75        1.00

Lecture 14 In-class Exercises - Q1 - Review

Based on the Parameter Estimates table for the specified categorical regression model, which category is the baseline category?

                              Model Summary                                
--------------------------------------------------------------------------
R                           0.924       RMSE                    32079.481 
R-Squared                   0.854       MSE                1029093133.260 
Adj. R-Squared              0.848       Coef. Var                   8.223 
Pred R-Squared              0.836       AIC                      1352.620 
MAE                     28012.693       SBC                      1360.792 
--------------------------------------------------------------------------
 RMSE: Root Mean Square Error 
 MSE: Mean Square Error 
 MAE: Mean Absolute Error 
 AIC: Akaike Information Criteria 
 SBC: Schwarz Bayesian Criteria 

                                      ANOVA                                        
----------------------------------------------------------------------------------
                        Sum of                                                    
                       Squares        DF         Mean Square       F         Sig. 
----------------------------------------------------------------------------------
Regression    341892090457.686         2    170946045228.843     157.37    0.0000 
Residual       58658308595.823        54      1086264973.997                      
Total         400550399053.509        56                                          
----------------------------------------------------------------------------------

                                         Parameter Estimates                                          
-----------------------------------------------------------------------------------------------------
       model          Beta    Std. Error    Std. Beta      t        Sig          lower         upper 
-----------------------------------------------------------------------------------------------------
 (Intercept)    137549.093     17620.447                  7.806    0.000    102222.224    172875.963 
 Square_Feet       137.879        10.836        0.670    12.725    0.000       116.155       159.602 
RemodeledYes     90917.216      8834.268        0.542    10.291    0.000     73205.575    108628.858 
-----------------------------------------------------------------------------------------------------

A Comment About Formatted Output

  • In HW 6 and below I use R coding to format the output to make it easier to read.

  • The values are IDENTICAL to the unformatted output.

  • Note: Formatted Output will differ in appearance depending on where it is viewed, i.e. slides, html file, or .Qmd file.

Formatted Abridged Output (Similar to HW 6)

model Beta Std.Error t Sig
(Intercept) 137549.09 17620.45 7.81 0
Square_Feet 137.88 10.84 12.72 0
RemodeledYes 90917.22 8834.27 10.29 0

Quick Review of Categorical Regression

  • On Tuesday we covered the Parallel Lines Model:

    • A Parallel Lines model has two X variables, one quantitative and one categorical variable.

    • Model estimates a separate SLR model for each category in the categorical variable.

    • Model assumes all categories have the same SLOPE.

    • Model estimates a separate INTERCEPT for each category.

    • Model output shows results of a hypothesis test to determine if each non-baseline category’s intercept is significantly different from baseline intercept.

Interactive Plot of House Remodel Data

Calculations from House Model

  • By default R chooses baseline categories alphabetically

    • No is before Yes so un-Remodeled houses are the baseline

    • Un-Remodeled (No) SLR Model:

      • Est. Price = 137549.093 + 137.879 * Square_Feet
    • Remodeled (Yes) SLR Model:

      • Est. Price = 137549.093 + 137.879 * Square_Feet + 90917.216

      • Est. Price = 137549.093 + 90917.216 + 137.879 * Square_Feet

      • Est. Price = 228466.3 + 137.879 * Square_Feet

  • Interpretation:

    • Prices of remodeled houses are about 91 thousand dollars more than similar houses without remodeling, after accounting for square footage.

    • This difference is statistically significant (P-value < 0.001)

HW 6 - Questions 1 - 6

  • This part of HW 6 examines data similar to the House-Remodel data examined in Lecture 13 and the review question.

  • The dataset is smaller and the numbers are different, but the questions are essentially the same.

Categorical Regression with Interactions

  • The categorical models covered so far assume that the SLR models for all categories have the same slope.

  • How do we examine that assumption?

  • For example:

    • In the celebrity data in Lecture 13, the data showed a decrease in earnings as they got older.

    • Slope was assumed to be IDENTICAL for both males and females

    • That may not be true for all celebrities.

  • In the following small dataset, we will look at male celebrities only and examine if actors and athletes salaries follow the same trend.

Import and Examine Celebrity Profession Data

Code
```{r import and examine celeb_prof data, echo=T}
# import and examine celeb profession dataset
celeb_prof <- read_csv("data/celeb_prof.csv", show_col_types=F) 
head(celeb_prof) |> kable()

# use table to summarize data by category
celeb_prof |> select(Profession) |> table()
```
Celebrity Earnings Age Profession
Jim Parsons 29 44 Actor
Johnny Depp 48 53 Actor
Tom Cruise 53 55 Actor
Leonardo Dicaprio 29 43 Actor
Jackie Chan 61 62 Actor
Mark Wahlberg 32 45 Actor
Profession
  Actor Athlete 
      8       8 

Examine Correlations in Celebrity Professions Data

Note: If categories have different slopes, correlations for whole dataset will be misleading.

Code
celeb_prof |> select(Earnings, Age) |> cor() |> round(2) # all data
         Earnings   Age
Earnings     1.00 -0.46
Age         -0.46  1.00
Code
celeb_prof |> filter(Profession=="Actor") |>             # actors only
  select(Earnings, Age) |> cor() |> round(2)
         Earnings  Age
Earnings     1.00 0.99
Age          0.99 1.00
Code
celeb_prof |> filter(Profession=="Athlete") |>           # athletes only
  select(Earnings, Age) |> cor() |> round(2)
         Earnings   Age
Earnings     1.00 -0.98
Age         -0.98  1.00

Explore and Plot Data

  • Scatter plot shows that a regression model should be created with

    • Different intercepts for each profession

    • Different slopes for each profession

Interactive Model Plot - Celebrity Professions

Regression Model - Celebrity Professions

Now that we understand the data and linear trends, we can examine and interpret the regression model output.

                        Model Summary                          
--------------------------------------------------------------
R                       0.987       RMSE                2.640 
R-Squared               0.974       MSE                 6.968 
Adj. R-Squared          0.967       Coef. Var           6.058 
Pred R-Squared          0.951       AIC                86.467 
MAE                     2.265       SBC                90.330 
--------------------------------------------------------------
 RMSE: Root Mean Square Error 
 MSE: Mean Square Error 
 MAE: Mean Absolute Error 
 AIC: Akaike Information Criteria 
 SBC: Schwarz Bayesian Criteria 

                                ANOVA                                 
---------------------------------------------------------------------
                Sum of                                               
               Squares        DF    Mean Square       F         Sig. 
---------------------------------------------------------------------
Regression    4131.949         3       1377.316    148.246    0.0000 
Residual       111.489        12          9.291                      
Total         4243.437        15                                     
---------------------------------------------------------------------

                                         Parameter Estimates                                           
------------------------------------------------------------------------------------------------------
                model       Beta    Std. Error    Std. Beta       t        Sig       lower      upper 
------------------------------------------------------------------------------------------------------
          (Intercept)    -50.297         9.054                  -5.555    0.000    -70.023    -30.571 
                  Age      1.824         0.179        0.983     10.170    0.000      1.433      2.215 
    ProfessionAthlete    227.218        12.389        0.263     18.340    0.000    200.224    254.212 
Age:ProfessionAthlete     -5.063         0.293       -1.487    -17.294    0.000     -5.701     -4.425 
------------------------------------------------------------------------------------------------------

Model Interpretation - Interpreting model coefficients (betas)

  • Baseline category is first alphabetically

    • Actor comes before Athlete in the alphabet so Actor is the baseline category.

    • **Actor SLR Model:

      • Earnings = -50.297 + 1.824*Age**
  • ProfessionAthlete term: Difference from Actor (baseline) model Intercept

  • Age:ProfessionAthlete term: Difference from Actor model Slope

    • Athlete SLR Model requires some calculations:

      • Earnings = -50.297 + 1.824*Age + 227.218 - 5.063*Age

      • Earnings = (-50.297 + 227.218) + (1.824 - 5.063)*Age

      • Earnings = 176.921 - ____*Age

      • Athlete SLR Model: Earnings = 176.921 - ____*Age

Lecture 14 In-class Exercises - Q2

What is the slope term (estimated beta for Age) for the Athlete SLR model?

Specify answer to two decimal places.

Abridged Output

  • ProfessionAthlete term: Difference from Actor (baseline) model Intercept

  • Age:ProfessionAthlete term: Difference from Actor model Slope

    • Earnings = (-50.297 + 227.218) + (1.824 - 5.063)*Age

    • Athlete SLR Model: Earnings = 176.921 - ____*Age

Determining Statistical Significance

Conclusions from Actor and Athlete Model

Abridged Output

  • P-value for difference in intercepts (ProfessionAthlete): < 0.001

    • Actors SLR model and Athletes SLR model intercepts are significantly different.
  • P-value for difference in slopes (Age:ProfessionAthlete): < 0.001

    • Actors SLR model and Athletes SLR model slopes are significantly different.
  • These interpretations are in agreement with what we can easily see in the Interactive Model Plot.

Movie Genres and Costs

  • Is length of a movie (Runtime) a good predictor of the movie budget?

  • Does the relationship between movie length and budget differ by movie genre?

Import Movie Data

Code
```{r import movies data, echo=T}
movies <- read_csv("data/movies.csv", show_col_types = F)  # Import and examine data
head(movies, 4) |> kable()

movies |> select(Genre) |> table()  # use table to examine categories in the data
```
Movie Genre Budget Runtime
Paranormal Activity 3 Suspense / Horror 5 83
The Others Suspense / Horror 17 104
The Lincoln Lawyer Suspense / Horror 40 118
Fright Night Suspense / Horror 30 106
Genre
           Action Suspense / Horror 
               10                10 

Examine Correlations in Movie Data

Note: If categories have different slopes, correlations for whole dataset will be misleading.

Code
```{r correlations movie data, echo=T}
movies |> select(Budget, Runtime) |> cor() |> round(2) # all data

movies |> filter(Genre == "Action") |>                 # action movies
  select(Budget, Runtime) |> cor() |> round(2)

movies |> filter(Genre == "Suspense / Horror") |>      # suspense/horror movies
  select(Budget, Runtime) |> cor() |> round(2)
```
        Budget Runtime
Budget    1.00    0.76
Runtime   0.76    1.00
        Budget Runtime
Budget    1.00    0.95
Runtime   0.95    1.00
        Budget Runtime
Budget    1.00    0.99
Runtime   0.99    1.00

Explore and Plot Data

  • Scatter plot shows that a regression model should be created with

    • Different intercepts for each genre

    • Different slopes for each genre

Interactive Model Plot - Movie Data

Movie Genres Regression Model

Again, we can examine and interpret the regression model output.

Abridged Output

Lecture 14 In-class Exercises - Q3

What is the intercept term (estimated beta) for the Suspense / Horror SLR model?

  • Specify answer to two decimal places.

Abridged Output

  • GenreSuspense / Horror term: Difference from Action (baseline) model Intercept

  • Runtime:GenreSuspense / Horror term: Difference from Action model Slope

    • Budget = (-286.67 + 218.75) + (3.25 - 2.35)*Runtime

    • Suspense / Horror SLR Model: Budget = ____ - 0.9*Runtime

Determining Statistical Significance

Conclusions from Movie Genre Model

Abridged Output

  • P-value for diff. in intercepts (GenreSuspense / Horror): < 0.001

    • Intercepts for these two distinct genre SLR models are _____ (Next Question).
  • P-value for diff. in slopes (Runtime:GenreSuspense / Horror term): < 0.001

    • Slopes for for these two distinct genre SLR models are _____ (Next Question).
  • These interpretations are in agreement with what we can easily see in the Interactive Model Plot.

Lecture 14 In-class Exercises - Q4

Abridged Output

The difference in intercepts and the difference in slopes between the model for Action movies and the model for Suspense / Horror movies are both _____.

A. statistically significant

B. statistically insignificant

Lecture 14 In-class Exercises - Q5

Recall that the cutoff for determining significance of a regression model term based on it’s P-value is 0.05.


Fill in the blank:

The smaller the P-value, the _____ evidence there is that the Beta coefficient is non-zero and the term is useful to the model.

HW 6 - Questions 7 - 16

  • Dataset has three categories of Diamonds:

    • Colorless, Faint yellow, and Nearly colorless
  • Colorless is first alphabetically so that is the baseline category by default.

    • Each color category has unique intercept AND a unique slope.

    • The interactive model plot and abridged regression output are provided.

    • All Blackboard questions can be answered by rendering .qmd file to examine .html output.

  • Helpful TIP: In addition to other recommended options, change preview option (see next slide).

HW 6 - Change HTML Preview Option

  • For HW 6 you do not have to write any R code.

  • Instead you are expected to correctly interpret provided output.

  • Quiz 2 will have similar output WITHOUT the interactive plots.

  • Change the following option in the Basic tab of the R Markdown options:

Show output preview in Viewer Pane

Show output preview in Viewer Pane

Looking Ahead - What’s Next?

  • This week, ALL of the categorical models could be simplified to multiple SLRs, with same or different slopes.

  • ALL of the variables have had P-values less than 0.05 so the terms were all useful.

  • There are many many model options where these two facts are not true.

  • Time permitting, here’s a brief look at a dataset with many explanatory variables:

Charges ln_charges Age Sex BMI Children Smoker Region
16884.924 9.734176 19 female 27.90 0 yes southwest
1725.552 7.453303 18 male 33.77 1 no southeast
4449.462 8.400538 28 male 33.00 3 no southeast

Insurance Data Model and Variable Selection

  • There are 3 quantitative variables:

    • Age, BMI, and Children
  • There are 3 categorical variables:

    • Sex, Smoker, Region
  • There are literally hundreds of possible models including interaction terms.

  • Note that an interaction can also be between two quantitative variables.

  • You can also have interaction terms with three variables (but I try to avoid those).

  • How do we sort through all of the possible options?

    • Software helps us pare down all the possible models to a few choices.
    • Analyst then uses critical thinking and examination of data to determine final model.
  • Model and variable selection methods are the next set of topics.

Key Points from Today

  • Categorical Interaction Model

    • Separate SLR for each group.
    • BOTH slopes and intercepts can differ by category
    • We can test if interaction term (slope difference) is significant.
  • Next Topics

    • Comparing model goodness of fit
    • Introduction to variable selection


  • HW 6 is now available and is due on Wed. 3/6.

  • Date of Quiz 2 has been changed to Tuesday, 4/1.

To submit an Engagement Question or Comment about material from Lecture 14: Submit it by midnight today (day of lecture).