Lecture 14 - Categorical Regression - Interaction Model

Penelope Pooler Eisenbies
BUA 345

2024-02-28

Housekeeping

Today’s plan 📋
- Review Parallel Lines Model
- Introduce Interaction term and Interaction Model
- Work through how to interpret model output
- Introduce HW 6
- Talk about next steps

Review Question - Import data

house_remodel <- read_csv("data/house_remodel.csv", show_col_types = F) |>  # import data
  glimpse(width=75)

Rows: 57
Columns: 3
$ Price       <dbl> 554000, 484000, 391000, 354000, 410000, 349000, 40900…
$ Square_Feet <dbl> 2702, 2378, 1846, 1820, 1794, 1768, 1752, 1719, 1676,…
$ Remodeled   <chr> "No", "No", "No", "No", "No", "No", "No", "No", "No",…

house_remodel |> select(Remodeled) |> table()   # number of obs by category

Remodeled
 No Yes 
 29  28

house_remodel |> select(Price, Square_Feet) |> cor() |> round(2)  # correlation of price & sq. ft.

            Price Square_Feet
Price        1.00        0.75
Square_Feet  0.75        1.00

💥 Lecture 14 In-class Exercises - Q1 - Review 💥

Based on the Parameter Estimates table for the specified categorical regression model, which category is the baseline category?

(house_rem_cat_ols<- ols_regress(Price ~ Square_Feet + Remodeled, data=house_remodel))

                              Model Summary                                
--------------------------------------------------------------------------
R                           0.924       RMSE                    32079.481 
R-Squared                   0.854       MSE                1086264973.997 
Adj. R-Squared              0.848       Coef. Var                   8.223 
Pred R-Squared              0.836       AIC                      1352.620 
MAE                     28012.693       SBC                      1360.792 
--------------------------------------------------------------------------
 RMSE: Root Mean Square Error 
 MSE: Mean Square Error 
 MAE: Mean Absolute Error 
 AIC: Akaike Information Criteria 
 SBC: Schwarz Bayesian Criteria 

                                      ANOVA                                        
----------------------------------------------------------------------------------
                        Sum of                                                    
                       Squares        DF         Mean Square       F         Sig. 
----------------------------------------------------------------------------------
Regression    341892090457.686         2    170946045228.843     157.37    0.0000 
Residual       58658308595.823        54      1086264973.997                      
Total         400550399053.509        56                                          
----------------------------------------------------------------------------------

                                         Parameter Estimates                                          
-----------------------------------------------------------------------------------------------------
       model          Beta    Std. Error    Std. Beta      t        Sig          lower         upper 
-----------------------------------------------------------------------------------------------------
 (Intercept)    137549.093     17620.447                  7.806    0.000    102222.224    172875.963 
 Square_Feet       137.879        10.836        0.670    12.725    0.000       116.155       159.602 
RemodeledYes     90917.216      8834.268        0.542    10.291    0.000     73205.575    108628.858 
-----------------------------------------------------------------------------------------------------

A Quick Comment About Formatted Output

In HW 6 and below I use R coding to format the output to make it easier to read.
The values are IDENTICAL to the unformatted output.
Note: Formatted Output will differ in appearance depending on where it is viewed, i.e. slides, html file, or .Qmd file.

Formatted Abridged Output (Similar to HW 6)

model	Beta	Std.Error	t
(Intercept)	137549.09	17620.45	7.81
Square_Feet	137.88	10.84	12.72
RemodeledYes	90917.22	8834.27	10.29

Quick Review of Categorical Regression

On Tuesday we covered the Parallel Lines Model:
- A Parallel Lines model has two X variables, one quantitative and one categorical variable.
- Model estimates a separate SLR model for each category in the categorical variable.
- Model assumes all categories have the same SLOPE.
- Model estimates a separate INTERCEPT for each category.
- Model output shows results of a hypothesis test to determine if each non-baseline category’s intercept is significantly different from baseline intercept.

Interactive Plot of House Remodel Data

Calculations from House Model

By default R chooses baseline categories alphabetically
- No is before Yes so un-Remodeled houses are the baseline
- Un-Remodeled (No) SLR Model:
  - Est. Price = 137549.093 + 137.879 * Square_Feet
- Remodeled (Yes) SLR Model:
  - Est. Price = 137549.093 + 137.879 * Square_Feet + 90917.216
  - Est. Price = 137549.093 + 90917.216 + 137.879 * Square_Feet
  - Est. Price = 228466.3 + 137.879 * Square_Feet
Interpretation:
- Prices of remodeled houses are about 91 thousand dollars more than similar houses without remodeling, after accounting for square footage.
- This difference is statistically significant (P-value < 0.001)

HW 6 - Questions 1 - 6

Analyses and output are identical to what was done in Lecture 13.
Output shown below are a screenshot from provided .html for HW 6

Categorical Regression with Interactions

The categorical models covered so far assume that the SLR models for all categories have the same slope.
How do we examine that assumption?
For example:
- In the celebrity data in Lecture 13, the data showed a decrease in earnings as they got older.
- Slope was assumed to be IDENTICAL for both males and females
- That may not be true for all celebrities.
In the following small dataset, we will look at male celebrities only and examine if actors and athletes salaries follow the same trend.

Import and Examine Celebrity Profession Data

# import and examine celeb profession dataset
celeb_prof <- read_csv("data/celeb_prof.csv", show_col_types=F) |>
  glimpse(width=75)

Rows: 16
Columns: 4
$ Celebrity  <chr> "Jim Parsons", "Johnny Depp", "Tom Cruise", "Leonardo …
$ Earnings   <dbl> 29, 48, 53, 29, 61, 32, 36, 41, 81, 44, 70, 77, 68, 38…
$ Age        <dbl> 44, 53, 55, 43, 62, 45, 49, 50, 29, 40, 32, 32, 35, 42…
$ Profession <chr> "Actor", "Actor", "Actor", "Actor", "Actor", "Actor", …

# use table to summarize data by category
celeb_prof |> select(Profession) |> table()

Profession
  Actor Athlete 
      8       8

Examine Correlations in Celebrity Professions Data

Note: If categories have different slopes, correlations for whole dataset will be misleading.

celeb_prof |> select(Earnings, Age) |> cor() |> round(2) # all data

         Earnings   Age
Earnings     1.00 -0.46
Age         -0.46  1.00

celeb_prof |> filter(Profession=="Actor") |>             # actors only
  select(Earnings, Age) |> cor() |> round(2)

         Earnings  Age
Earnings     1.00 0.99
Age          0.99 1.00

celeb_prof |> filter(Profession=="Athlete") |>           # athletes only
  select(Earnings, Age) |> cor() |> round(2)

         Earnings   Age
Earnings     1.00 -0.98
Age         -0.98  1.00

Explore and Plot Data

Scatter plot shows that a regression model should be created with
- Different intercepts for each profession
- Different slopes for each profession

Interactive Model Plot - Celebrity Profession Data

Regression Model

Now that we understand the data and linear trends, we can examine and interpret the regression model output.

(celeb_interaction_ols<- ols_regress(Earnings ~ Age + Profession + Age*Profession,
                                     data=celeb_prof, iterm = T))

                        Model Summary                          
--------------------------------------------------------------
R                       0.987       RMSE                2.640 
R-Squared               0.974       MSE                 9.291 
Adj. R-Squared          0.967       Coef. Var           6.058 
Pred R-Squared          0.951       AIC                86.467 
MAE                     2.265       SBC                90.330 
--------------------------------------------------------------
 RMSE: Root Mean Square Error 
 MSE: Mean Square Error 
 MAE: Mean Absolute Error 
 AIC: Akaike Information Criteria 
 SBC: Schwarz Bayesian Criteria 

                                ANOVA                                 
---------------------------------------------------------------------
                Sum of                                               
               Squares        DF    Mean Square       F         Sig. 
---------------------------------------------------------------------
Regression    4131.949         3       1377.316    148.246    0.0000 
Residual       111.489        12          9.291                      
Total         4243.437        15                                     
---------------------------------------------------------------------

                                         Parameter Estimates                                           
------------------------------------------------------------------------------------------------------
                model       Beta    Std. Error    Std. Beta       t        Sig       lower      upper 
------------------------------------------------------------------------------------------------------
          (Intercept)    -50.297         9.054                  -5.555    0.000    -70.023    -30.571 
                  Age      1.824         0.179        0.983     10.170    0.000      1.433      2.215 
    ProfessionAthlete    227.218        12.389        0.263     18.340    0.000    200.224    254.212 
Age:ProfessionAthlete     -5.063         0.293       -1.487    -17.294    0.000     -5.701     -4.425 
------------------------------------------------------------------------------------------------------

Model Interpretation - Interpreting model coefficients (betas)

Baseline category is first alphabetically
- Actor comes before Athlete in the alphabet so Actor is the baseline category.
- Actor SLR Model: Earnings = -50.297 + 1.824*Age
ProfessionAthlete term: Difference from Actor (baseline) model Intercept
Age:ProfessionAthlete term: Difference from Actor model Slope
- Athlete SLR Model requires some calculations:
  - Earnings = -50.297 + 1.824*Age + 227.218 - 5.063*Age
  - Earnings = (-50.297 + 227.218) + (1.824 - 5.063)*Age
  - Earnings = 176.921 - ____*Age
  - Athlete SLR Model: Earnings = 176.921 - ____*Age

💥 Lecture 14 In-class Exercises - Q2 💥

What is the slope term (estimated beta for Age) for the Athlete SLR model?

Specify answer to two decimal places.

Abridged Output

ProfessionAthlete term: Difference from Actor (baseline) model Intercept
Age:ProfessionAthlete term: Difference from Actor model Slope
- Earnings = (-50.297 + 227.218) + (1.824 - 5.063)*Age
- Athlete SLR Model: Earnings = 176.921 - ____*Age

Determining Statistical Significance

In this case, we don’t need to examine the P-values because the model differences between groups are so clear (but we will).
The two intercepts and two slopes are VERY different.
Reminder of Hypothesis Testing concepts:
- The SMALLER the P-value, the more evidence there is that the true value of Beta (model term) is not zero.
- If this sentence is not clear to you, you are responsible for reviewing the Review Materials on Hypothesis Tests and Significance Tests (and other related topics):

Conclusions from Actor and Athlete Model

Abridged Output

P-value for difference in intercepts (ProfessionAthlete): < 0.001
- Actors SLR model and Athletes SLR model intercepts are significantly different.
P-value for difference in slopes (Age:ProfessionAthlete): < 0.001
- Actors SLR model and Athletes SLR model slopes are significantly different.
These interpretations are in agreement with what we can easily see in the Interactive Model Plot.

Movie Genres and Costs

Is length of a movie (Runtime) a good predictor of the movie budget?
Does the relationship between movie length and budget differ by movie genre?

Import Movie Data

movies <- read_csv("data/movies.csv", show_col_types = F) |>  # Import and examine data
  glimpse(width=75)

Rows: 20
Columns: 4
$ Movie   <chr> "Paranormal Activity 3", "The Others", "The Lincoln Lawye…
$ Genre   <chr> "Suspense / Horror", "Suspense / Horror", "Suspense / Hor…
$ Budget  <dbl> 5, 17, 40, 30, 37, 27, 25, 16, 85, 12, 195, 125, 140, 90,…
$ Runtime <dbl> 83, 104, 118, 106, 114, 105, 99, 91, 168, 84, 148, 131, 1…

movies |> select(Genre) |> table()  # use table to examine categories in the data

Genre
           Action Suspense / Horror 
               10                10

Examine Correlations in Movie Data

Note: If categories have different slopes, correlations for whole dataset will be misleading.

movies |> select(Budget, Runtime) |> cor() |> round(2) # all data

        Budget Runtime
Budget    1.00    0.76
Runtime   0.76    1.00

movies |> filter(Genre == "Action") |>                 # action movies
  select(Budget, Runtime) |> cor() |> round(2)

        Budget Runtime
Budget    1.00    0.95
Runtime   0.95    1.00

movies |> filter(Genre == "Suspense / Horror") |>      # suspense/horror movies
  select(Budget, Runtime) |> cor() |> round(2)

        Budget Runtime
Budget    1.00    0.99
Runtime   0.99    1.00

Explore and Plot Data

Scatter plot shows that a regression model should be created with
- Different intercepts for each genre
- Different slopes for each genre

Interactive Model Plot - Movie Data

Regression Model

Again, we can examine and interpret the regression model output.

(movie_interaction_ols<- ols_regress(Budget ~ Runtime + Genre + 
                                       Runtime*Genre, data=movies, iterm = T))

Abridged Output

💥 Lecture 14 In-class Exercises - Q3 💥

What is the intercept term (estimated beta) for the Suspense / Horror SLR model?

Specify answer to two decimal places.

Abridged Output

GenreSuspense / Horror term: Difference from Action (baseline) model Intercept
Runtime:GenreSuspense / Horror term: Difference from Action model Slope
- Budget = (-286.67 + 218.75) + (3.25 - 2.35)*Runtime
- Suspense / Horror SLR Model: Budget = ____ - 0.9*Runtime

Determining Statistical Significance

In this case, we don’t need to examine the P-values because the model differences between groups are so clear (but we will).
The two intercepts and two slopes are VERY different.
Reminder of Hypothesis Testing concepts:
- The SMALLER the P-value, the more evidence there is that the true value of Beta (model term) is not zero.
- If this sentence is not clear to you, you are responsible for reviewing the Review Materials on Hypothesis Tests and Significance Tests (and other related topics):

Conclusions from Movie Genre Model

Abridged Output

P-value for difference in intercepts (GenreSuspense / Horror): < 0.001
- Intercepts for these two distinct genre SLR models are _____ (Next Question).
P-value for difference in slopes (Runtime:GenreSuspense / Horror term): < 0.001
Slopes for for these two distinct genre SLR models are _____ (Next Question).
These interpretations are in agreement with what we can easily see in the Interactive Model Plot.

💥 Lecture 14 In-class Exercises - Q4 💥

Recall that the cutoff for determining significance of a regression model term based on it’s P-value is 0.05.

Fill in the blank:

The smaller the P-value, the _____ evidence there is that the Beta coefficient is non-zero and the term is useful to the model.

HW 6 - Questions 7 - 16

Dataset has three categories of Diamonds:
- Colorless
- Faint yellow
- Nearly colorless

Colorless is first alphabetically so that is the baseline category by default.
- Each color category has unique intercept.
- Each color category has a unique slope.
- The interactive model plot and abridged regression output are provided.
- All Blackboard questions can be answered using provided .html output and verified using provided. interactive plot.

Looking Ahead - What’s Next?

This week, ALL of the categorical models we have covered could be simplified to multiple SLRs, with same or different slopes.
ALL of the variables have had P-values less than 0.05 so the terms were all useful.
There are many many model options where these two facts are not true.
Time permitting, here’s a brief look at dataset with many explanatory variables:

insurance <- read_csv("data/Insurance.csv", show_col_types=F) |>
  glimpse(width=75)

Rows: 1,338
Columns: 8
$ Charges    <dbl> 16884.924, 1725.552, 4449.462, 21984.471, 3866.855, 37…
$ ln_charges <dbl> 9.734176, 7.453302, 8.400538, 9.998092, 8.260197, 8.23…
$ Age        <dbl> 19, 18, 28, 33, 32, 31, 46, 37, 37, 60, 25, 62, 23, 56…
$ Sex        <chr> "female", "male", "male", "male", "male", "female", "f…
$ BMI        <dbl> 27.900, 33.770, 33.000, 22.705, 28.880, 25.740, 33.440…
$ Children   <dbl> 0, 1, 3, 0, 0, 0, 1, 3, 2, 0, 0, 0, 0, 0, 0, 1, 1, 0, …
$ Smoker     <chr> "yes", "no", "no", "no", "no", "no", "no", "no", "no",…
$ Region     <chr> "southwest", "southeast", "southeast", "northwest", "n…

Insurance Data Variable and Model Selection

There are 3 quantitative variables:
- Age, BMI, and Children
There are 3 categorical variables:
- Sex, Smoker, Region
There are literally hundreds of possible models including interaction terms.
Note that an interaction can also be between two quantitative variables.
You can also have interaction terms with three variables (but I try to avoid those).
How do we sort through all of the possible options?
- Software helps us pare down all the possible models to a few choices.
- Analyst then uses critical thinking and examination of data to determine final model.
Variable and Model Selection methods are the next set of topics.

Key Points from Today

Categorical Interaction Model
- Separate SLR for each group.
- BOTH slopes and intercepts can differ by category
- We can test if interaction term (slope difference) is significant.
Next Topics
- Comparing model goodness of fit
- Introduction to variable selection
HW 6 is now available and is due on Wed. 3/6.

To submit an Engagement Question or Comment about material from Lecture 14: Submit by midnight today (day of lecture). Click on Link next to the ❓ under Lecture 14