Lecture 14 - Categorical Regression - Interaction Model

Penelope Pooler Eisenbies
BUA 345

2024-02-28

Housekeeping

  • Today’s plan 📋

    • Review Parallel Lines Model

    • Introduce Interaction term and Interaction Model

    • Work through how to interpret model output

    • Introduce HW 6

    • Talk about next steps

Review Question - Import data

house_remodel <- read_csv("data/house_remodel.csv", show_col_types = F) |>  # import data
  glimpse(width=75) 
Rows: 57
Columns: 3
$ Price       <dbl> 554000, 484000, 391000, 354000, 410000, 349000, 40900…
$ Square_Feet <dbl> 2702, 2378, 1846, 1820, 1794, 1768, 1752, 1719, 1676,…
$ Remodeled   <chr> "No", "No", "No", "No", "No", "No", "No", "No", "No",…
house_remodel |> select(Remodeled) |> table()   # number of obs by category
Remodeled
 No Yes 
 29  28 
house_remodel |> select(Price, Square_Feet) |> cor() |> round(2)  # correlation of price & sq. ft.
            Price Square_Feet
Price        1.00        0.75
Square_Feet  0.75        1.00

💥 Lecture 14 In-class Exercises - Q1 - Review 💥

Based on the Parameter Estimates table for the specified categorical regression model, which category is the baseline category?

(house_rem_cat_ols<- ols_regress(Price ~ Square_Feet + Remodeled, data=house_remodel))
                              Model Summary                                
--------------------------------------------------------------------------
R                           0.924       RMSE                    32079.481 
R-Squared                   0.854       MSE                1086264973.997 
Adj. R-Squared              0.848       Coef. Var                   8.223 
Pred R-Squared              0.836       AIC                      1352.620 
MAE                     28012.693       SBC                      1360.792 
--------------------------------------------------------------------------
 RMSE: Root Mean Square Error 
 MSE: Mean Square Error 
 MAE: Mean Absolute Error 
 AIC: Akaike Information Criteria 
 SBC: Schwarz Bayesian Criteria 

                                      ANOVA                                        
----------------------------------------------------------------------------------
                        Sum of                                                    
                       Squares        DF         Mean Square       F         Sig. 
----------------------------------------------------------------------------------
Regression    341892090457.686         2    170946045228.843     157.37    0.0000 
Residual       58658308595.823        54      1086264973.997                      
Total         400550399053.509        56                                          
----------------------------------------------------------------------------------

                                         Parameter Estimates                                          
-----------------------------------------------------------------------------------------------------
       model          Beta    Std. Error    Std. Beta      t        Sig          lower         upper 
-----------------------------------------------------------------------------------------------------
 (Intercept)    137549.093     17620.447                  7.806    0.000    102222.224    172875.963 
 Square_Feet       137.879        10.836        0.670    12.725    0.000       116.155       159.602 
RemodeledYes     90917.216      8834.268        0.542    10.291    0.000     73205.575    108628.858 
-----------------------------------------------------------------------------------------------------

A Quick Comment About Formatted Output

  • In HW 6 and below I use R coding to format the output to make it easier to read.

  • The values are IDENTICAL to the unformatted output.

  • Note: Formatted Output will differ in appearance depending on where it is viewed, i.e. slides, html file, or .Qmd file.

Formatted Abridged Output (Similar to HW 6)

model Beta Std.Error t Sig
(Intercept) 137549.09 17620.45 7.81 0
Square_Feet 137.88 10.84 12.72 0
RemodeledYes 90917.22 8834.27 10.29 0

Quick Review of Categorical Regression

  • On Tuesday we covered the Parallel Lines Model:

    • A Parallel Lines model has two X variables, one quantitative and one categorical variable.

    • Model estimates a separate SLR model for each category in the categorical variable.

    • Model assumes all categories have the same SLOPE.

    • Model estimates a separate INTERCEPT for each category.

    • Model output shows results of a hypothesis test to determine if each non-baseline category’s intercept is significantly different from baseline intercept.

Interactive Plot of House Remodel Data

Calculations from House Model

  • By default R chooses baseline categories alphabetically

    • No is before Yes so un-Remodeled houses are the baseline

    • Un-Remodeled (No) SLR Model:

      • Est. Price = 137549.093 + 137.879 * Square_Feet
    • Remodeled (Yes) SLR Model:

      • Est. Price = 137549.093 + 137.879 * Square_Feet + 90917.216

      • Est. Price = 137549.093 + 90917.216 + 137.879 * Square_Feet

      • Est. Price = 228466.3 + 137.879 * Square_Feet

  • Interpretation:

    • Prices of remodeled houses are about 91 thousand dollars more than similar houses without remodeling, after accounting for square footage.

    • This difference is statistically significant (P-value < 0.001)

HW 6 - Questions 1 - 6

  • Analyses and output are identical to what was done in Lecture 13.

  • Output shown below are a screenshot from provided .html for HW 6

Categorical Regression with Interactions

  • The categorical models covered so far assume that the SLR models for all categories have the same slope.

  • How do we examine that assumption?

  • For example:

    • In the celebrity data in Lecture 13, the data showed a decrease in earnings as they got older.

    • Slope was assumed to be IDENTICAL for both males and females

    • That may not be true for all celebrities.

  • In the following small dataset, we will look at male celebrities only and examine if actors and athletes salaries follow the same trend.

Import and Examine Celebrity Profession Data


# import and examine celeb profession dataset
celeb_prof <- read_csv("data/celeb_prof.csv", show_col_types=F) |>
  glimpse(width=75)
Rows: 16
Columns: 4
$ Celebrity  <chr> "Jim Parsons", "Johnny Depp", "Tom Cruise", "Leonardo …
$ Earnings   <dbl> 29, 48, 53, 29, 61, 32, 36, 41, 81, 44, 70, 77, 68, 38…
$ Age        <dbl> 44, 53, 55, 43, 62, 45, 49, 50, 29, 40, 32, 32, 35, 42…
$ Profession <chr> "Actor", "Actor", "Actor", "Actor", "Actor", "Actor", …
# use table to summarize data by category
celeb_prof |> select(Profession) |> table()
Profession
  Actor Athlete 
      8       8 

Examine Correlations in Celebrity Professions Data

Note: If categories have different slopes, correlations for whole dataset will be misleading.

celeb_prof |> select(Earnings, Age) |> cor() |> round(2) # all data
         Earnings   Age
Earnings     1.00 -0.46
Age         -0.46  1.00
celeb_prof |> filter(Profession=="Actor") |>             # actors only
  select(Earnings, Age) |> cor() |> round(2)
         Earnings  Age
Earnings     1.00 0.99
Age          0.99 1.00
celeb_prof |> filter(Profession=="Athlete") |>           # athletes only
  select(Earnings, Age) |> cor() |> round(2)
         Earnings   Age
Earnings     1.00 -0.98
Age         -0.98  1.00

Explore and Plot Data

  • Scatter plot shows that a regression model should be created with

    • Different intercepts for each profession

    • Different slopes for each profession

Interactive Model Plot - Celebrity Profession Data

Regression Model

Now that we understand the data and linear trends, we can examine and interpret the regression model output.

(celeb_interaction_ols<- ols_regress(Earnings ~ Age + Profession + Age*Profession,
                                     data=celeb_prof, iterm = T))
                        Model Summary                          
--------------------------------------------------------------
R                       0.987       RMSE                2.640 
R-Squared               0.974       MSE                 9.291 
Adj. R-Squared          0.967       Coef. Var           6.058 
Pred R-Squared          0.951       AIC                86.467 
MAE                     2.265       SBC                90.330 
--------------------------------------------------------------
 RMSE: Root Mean Square Error 
 MSE: Mean Square Error 
 MAE: Mean Absolute Error 
 AIC: Akaike Information Criteria 
 SBC: Schwarz Bayesian Criteria 

                                ANOVA                                 
---------------------------------------------------------------------
                Sum of                                               
               Squares        DF    Mean Square       F         Sig. 
---------------------------------------------------------------------
Regression    4131.949         3       1377.316    148.246    0.0000 
Residual       111.489        12          9.291                      
Total         4243.437        15                                     
---------------------------------------------------------------------

                                         Parameter Estimates                                           
------------------------------------------------------------------------------------------------------
                model       Beta    Std. Error    Std. Beta       t        Sig       lower      upper 
------------------------------------------------------------------------------------------------------
          (Intercept)    -50.297         9.054                  -5.555    0.000    -70.023    -30.571 
                  Age      1.824         0.179        0.983     10.170    0.000      1.433      2.215 
    ProfessionAthlete    227.218        12.389        0.263     18.340    0.000    200.224    254.212 
Age:ProfessionAthlete     -5.063         0.293       -1.487    -17.294    0.000     -5.701     -4.425 
------------------------------------------------------------------------------------------------------

Model Interpretation - Interpreting model coefficients (betas)

  • Baseline category is first alphabetically

    • Actor comes before Athlete in the alphabet so Actor is the baseline category.

    • Actor SLR Model: Earnings = -50.297 + 1.824*Age

  • ProfessionAthlete term: Difference from Actor (baseline) model Intercept

  • Age:ProfessionAthlete term: Difference from Actor model Slope

    • Athlete SLR Model requires some calculations:

      • Earnings = -50.297 + 1.824*Age + 227.218 - 5.063*Age

      • Earnings = (-50.297 + 227.218) + (1.824 - 5.063)*Age

      • Earnings = 176.921 - ____*Age

      • Athlete SLR Model: Earnings = 176.921 - ____*Age

💥 Lecture 14 In-class Exercises - Q2 💥

What is the slope term (estimated beta for Age) for the Athlete SLR model?

Specify answer to two decimal places.

Abridged Output

  • ProfessionAthlete term: Difference from Actor (baseline) model Intercept

  • Age:ProfessionAthlete term: Difference from Actor model Slope

    • Earnings = (-50.297 + 227.218) + (1.824 - 5.063)*Age

    • Athlete SLR Model: Earnings = 176.921 - ____*Age

Determining Statistical Significance

Conclusions from Actor and Athlete Model

Abridged Output

  • P-value for difference in intercepts (ProfessionAthlete): < 0.001

    • Actors SLR model and Athletes SLR model intercepts are significantly different.
  • P-value for difference in slopes (Age:ProfessionAthlete): < 0.001

    • Actors SLR model and Athletes SLR model slopes are significantly different.
  • These interpretations are in agreement with what we can easily see in the Interactive Model Plot.

Movie Genres and Costs

  • Is length of a movie (Runtime) a good predictor of the movie budget?

  • Does the relationship between movie length and budget differ by movie genre?

Import Movie Data

movies <- read_csv("data/movies.csv", show_col_types = F) |>  # Import and examine data
  glimpse(width=75)
Rows: 20
Columns: 4
$ Movie   <chr> "Paranormal Activity 3", "The Others", "The Lincoln Lawye…
$ Genre   <chr> "Suspense / Horror", "Suspense / Horror", "Suspense / Hor…
$ Budget  <dbl> 5, 17, 40, 30, 37, 27, 25, 16, 85, 12, 195, 125, 140, 90,…
$ Runtime <dbl> 83, 104, 118, 106, 114, 105, 99, 91, 168, 84, 148, 131, 1…
movies |> select(Genre) |> table()  # use table to examine categories in the data
Genre
           Action Suspense / Horror 
               10                10 

Examine Correlations in Movie Data

Note: If categories have different slopes, correlations for whole dataset will be misleading.

movies |> select(Budget, Runtime) |> cor() |> round(2) # all data
        Budget Runtime
Budget    1.00    0.76
Runtime   0.76    1.00
movies |> filter(Genre == "Action") |>                 # action movies
  select(Budget, Runtime) |> cor() |> round(2)
        Budget Runtime
Budget    1.00    0.95
Runtime   0.95    1.00
movies |> filter(Genre == "Suspense / Horror") |>      # suspense/horror movies
  select(Budget, Runtime) |> cor() |> round(2)
        Budget Runtime
Budget    1.00    0.99
Runtime   0.99    1.00

Explore and Plot Data

  • Scatter plot shows that a regression model should be created with

    • Different intercepts for each genre

    • Different slopes for each genre

Interactive Model Plot - Movie Data

Regression Model

Again, we can examine and interpret the regression model output.


(movie_interaction_ols<- ols_regress(Budget ~ Runtime + Genre + 
                                       Runtime*Genre, data=movies, iterm = T))

Abridged Output

💥 Lecture 14 In-class Exercises - Q3 💥

What is the intercept term (estimated beta) for the Suspense / Horror SLR model?

  • Specify answer to two decimal places.

Abridged Output

  • GenreSuspense / Horror term: Difference from Action (baseline) model Intercept

  • Runtime:GenreSuspense / Horror term: Difference from Action model Slope

    • Budget = (-286.67 + 218.75) + (3.25 - 2.35)*Runtime

    • Suspense / Horror SLR Model: Budget = ____ - 0.9*Runtime

Determining Statistical Significance

Conclusions from Movie Genre Model

Abridged Output

  • P-value for difference in intercepts (GenreSuspense / Horror): < 0.001

    • Intercepts for these two distinct genre SLR models are _____ (Next Question).
  • P-value for difference in slopes (Runtime:GenreSuspense / Horror term): < 0.001

  • Slopes for for these two distinct genre SLR models are _____ (Next Question).

  • These interpretations are in agreement with what we can easily see in the Interactive Model Plot.

💥 Lecture 14 In-class Exercises - Q4 💥

Recall that the cutoff for determining significance of a regression model term based on it’s P-value is 0.05.


Fill in the blank:

The smaller the P-value, the _____ evidence there is that the Beta coefficient is non-zero and the term is useful to the model.

HW 6 - Questions 7 - 16

  • Dataset has three categories of Diamonds:

    • Colorless

    • Faint yellow

    • Nearly colorless


  • Colorless is first alphabetically so that is the baseline category by default.

    • Each color category has unique intercept.

    • Each color category has a unique slope.

    • The interactive model plot and abridged regression output are provided.

    • All Blackboard questions can be answered using provided .html output and verified using provided. interactive plot.

Looking Ahead - What’s Next?

  • This week, ALL of the categorical models we have covered could be simplified to multiple SLRs, with same or different slopes.

  • ALL of the variables have had P-values less than 0.05 so the terms were all useful.

  • There are many many model options where these two facts are not true.

  • Time permitting, here’s a brief look at dataset with many explanatory variables:

insurance <- read_csv("data/Insurance.csv", show_col_types=F) |>
  glimpse(width=75)
Rows: 1,338
Columns: 8
$ Charges    <dbl> 16884.924, 1725.552, 4449.462, 21984.471, 3866.855, 37…
$ ln_charges <dbl> 9.734176, 7.453302, 8.400538, 9.998092, 8.260197, 8.23…
$ Age        <dbl> 19, 18, 28, 33, 32, 31, 46, 37, 37, 60, 25, 62, 23, 56…
$ Sex        <chr> "female", "male", "male", "male", "male", "female", "f…
$ BMI        <dbl> 27.900, 33.770, 33.000, 22.705, 28.880, 25.740, 33.440…
$ Children   <dbl> 0, 1, 3, 0, 0, 0, 1, 3, 2, 0, 0, 0, 0, 0, 0, 1, 1, 0, …
$ Smoker     <chr> "yes", "no", "no", "no", "no", "no", "no", "no", "no",…
$ Region     <chr> "southwest", "southeast", "southeast", "northwest", "n…

Insurance Data Variable and Model Selection

  • There are 3 quantitative variables:

    • Age, BMI, and Children
  • There are 3 categorical variables:

    • Sex, Smoker, Region
  • There are literally hundreds of possible models including interaction terms.

  • Note that an interaction can also be between two quantitative variables.

  • You can also have interaction terms with three variables (but I try to avoid those).

  • How do we sort through all of the possible options?

    • Software helps us pare down all the possible models to a few choices.
    • Analyst then uses critical thinking and examination of data to determine final model.
  • Variable and Model Selection methods are the next set of topics.

Key Points from Today

  • Categorical Interaction Model

    • Separate SLR for each group.
    • BOTH slopes and intercepts can differ by category
    • We can test if interaction term (slope difference) is significant.
  • Next Topics

    • Comparing model goodness of fit
    • Introduction to variable selection
  • HW 6 is now available and is due on Wed. 3/6.

To submit an Engagement Question or Comment about material from Lecture 14: Submit by midnight today (day of lecture). Click on Link next to the under Lecture 14