| Price | Square_Feet | Remodeled |
|---|---|---|
| 554000 | 2702 | No |
| 484000 | 2378 | No |
| 391000 | 1846 | No |
Remodeled
No Yes
29 28
Price Square_Feet
Price 1.00 0.75
Square_Feet 0.75 1.00
Categorical Regression - Interaction Model
2026-02-26
HW 5 is due 2/27/2026 - 3 day grace period
HW 6 is due 3/4/2026 - 2 day grace period
HW 7 and HW 8 will due be after break - Can be completed without working during break.
Quiz 2 will be on 3/26/2026 - Practice Questions will be posted after break.
Review Parallel Lines Model
Introduce Interaction term and Interaction Model
Work through how to interpret model output
Introduce HW 6
Talk about next steps
| Price | Square_Feet | Remodeled |
|---|---|---|
| 554000 | 2702 | No |
| 484000 | 2378 | No |
| 391000 | 1846 | No |
Remodeled
No Yes
29 28
Price Square_Feet
Price 1.00 0.75
Square_Feet 0.75 1.00
Poll Everywhere - My User Name: penelopepoolereisenbies685
Based on the Parameter Estimates table for the specified categorical regression model, which category is the baseline category?
Model Summary
--------------------------------------------------------------------------
R 0.924 RMSE 32079.481
R-Squared 0.854 MSE 1029093133.260
Adj. R-Squared 0.848 Coef. Var 8.223
Pred R-Squared 0.836 AIC 1352.620
MAE 28012.693 SBC 1360.792
--------------------------------------------------------------------------
RMSE: Root Mean Square Error
MSE: Mean Square Error
MAE: Mean Absolute Error
AIC: Akaike Information Criteria
SBC: Schwarz Bayesian Criteria
ANOVA
----------------------------------------------------------------------------------
Sum of
Squares DF Mean Square F Sig.
----------------------------------------------------------------------------------
Regression 341892090457.686 2 170946045228.843 157.37 0.0000
Residual 58658308595.823 54 1086264973.997
Total 400550399053.509 56
----------------------------------------------------------------------------------
Parameter Estimates
-----------------------------------------------------------------------------------------------------
model Beta Std. Error Std. Beta t Sig lower upper
-----------------------------------------------------------------------------------------------------
(Intercept) 137549.093 17620.447 7.806 0.000 102222.224 172875.963
Square_Feet 137.879 10.836 0.670 12.725 0.000 116.155 159.602
RemodeledYes 90917.216 8834.268 0.542 10.291 0.000 73205.575 108628.858
-----------------------------------------------------------------------------------------------------
In HW 6 and below I use R coding to format the output to make it easier to read.
The values are IDENTICAL to the unformatted output.
Note: Formatted Output will differ in appearance depending on where it is viewed, i.e. slides, html file, or .Qmd file.
| model | Beta | Std.Error | t | Sig |
|---|---|---|---|---|
| (Intercept) | 137549.09 | 17620.45 | 7.81 | 0 |
| Square_Feet | 137.88 | 10.84 | 12.72 | 0 |
| RemodeledYes | 90917.22 | 8834.27 | 10.29 | 0 |
On Tuesday we covered the Parallel Lines Model:
A Parallel Lines model has two X variables, one quantitative and one categorical variable.
Model estimates a separate SLR model for each category in the categorical variable.
Model assumes all categories have the same SLOPE.
Model estimates a separate INTERCEPT for each category.
Model output shows results of a hypothesis test to determine if each non-baseline category’s intercept is significantly different from baseline intercept.
By default R chooses baseline categories alphabetically
No is before Yes so un-Remodeled houses are the baseline
Un-Remodeled (No) SLR Model:
Est. Price = 137549.093 + 137.879 * Square_FeetRemodeled (Yes) SLR Model:
Est. Price = 137549.093 + 137.879 * Square_Feet + 90917.216
Est. Price = 137549.093 + 90917.216 + 137.879 * Square_Feet
Est. Price = 228466.3 + 137.879 * Square_Feet
Interpretation:
Prices of remodeled houses are about 91 thousand dollars more than similar houses without remodeling, after accounting for square footage.
This difference is statistically significant (P-value < 0.001)
This part of HW 6 examines data similar to the House-Remodel data examined in Lecture 13 and the review question.
The dataset is smaller and the numbers are different, but the questions are essentially the same.
The categorical models covered so far assume that the SLR models for all categories have the same slope.
How do we examine that assumption?
For example:
In the celebrity data in Lecture 13, the data showed a decrease in earnings as they got older.
Slope was assumed to be IDENTICAL for both males and females
That may not be true for all celebrities.
In the following small dataset, we will look at male celebrities only and examine if actors and athletes salaries follow the same trend.
| Celebrity | Earnings | Age | Profession |
|---|---|---|---|
| Jim Parsons | 29 | 44 | Actor |
| Johnny Depp | 48 | 53 | Actor |
| Tom Cruise | 53 | 55 | Actor |
| Leonardo Dicaprio | 29 | 43 | Actor |
| Jackie Chan | 61 | 62 | Actor |
| Mark Wahlberg | 32 | 45 | Actor |
Profession
Actor Athlete
8 8
Note: If categories have different slopes, correlations for whole dataset will be misleading.
Earnings Age
Earnings 1.00 -0.46
Age -0.46 1.00
Earnings Age
Earnings 1.00 0.99
Age 0.99 1.00
Earnings Age
Earnings 1.00 -0.98
Age -0.98 1.00
Now that we understand the data and linear trends, we can examine and interpret the regression model output.
Model Summary
--------------------------------------------------------------
R 0.987 RMSE 2.640
R-Squared 0.974 MSE 6.968
Adj. R-Squared 0.967 Coef. Var 6.058
Pred R-Squared 0.951 AIC 86.467
MAE 2.265 SBC 90.330
--------------------------------------------------------------
RMSE: Root Mean Square Error
MSE: Mean Square Error
MAE: Mean Absolute Error
AIC: Akaike Information Criteria
SBC: Schwarz Bayesian Criteria
ANOVA
---------------------------------------------------------------------
Sum of
Squares DF Mean Square F Sig.
---------------------------------------------------------------------
Regression 4131.949 3 1377.316 148.246 0.0000
Residual 111.489 12 9.291
Total 4243.437 15
---------------------------------------------------------------------
Parameter Estimates
------------------------------------------------------------------------------------------------------
model Beta Std. Error Std. Beta t Sig lower upper
------------------------------------------------------------------------------------------------------
(Intercept) -50.297 9.054 -5.555 0.000 -70.023 -30.571
Age 1.824 0.179 0.983 10.170 0.000 1.433 2.215
ProfessionAthlete 227.218 12.389 0.263 18.340 0.000 200.224 254.212
Age:ProfessionAthlete -5.063 0.293 -1.487 -17.294 0.000 -5.701 -4.425
------------------------------------------------------------------------------------------------------
Baseline category is first alphabetically
Actor comes before Athlete in the alphabet so Actor is the baseline category.
Actor SLR Model:
Earnings = -50.297 + 1.824*AgeProfessionAthlete term: Difference from Actor (baseline) model Intercept
Age:ProfessionAthlete term: Difference from Actor model Slope
Athlete SLR Model requires some calculations:
Earnings = -50.297 + 1.824*Age + 227.218 - 5.063*Age
Earnings = (-50.297 + 227.218) + (1.824 - 5.063)*Age
Earnings = 176.921 - ____*Age
Athlete SLR Model: Earnings = 176.921 - ____*Age
Poll Everywhere - My User Name: penelopepoolereisenbies685
What is the slope term (estimated beta for Age) for the Athlete SLR model?
Specify answer to two decimal places.
ProfessionAthlete term: Difference from Actor (baseline) model Intercept
Age:ProfessionAthlete term: Difference from Actor model Slope
Earnings = (-50.297 + 227.218) + (1.824 - 5.063)*Age
Athlete SLR Model: Earnings = 176.921 - ____*Age
In this case, we don’t need to examine the P-values because the model differences between groups are so clear (but we will).
The two intercepts and two slopes are VERY different.
Reminder of Hypothesis Testing concepts:
The SMALLER the P-value, the more evidence there is that the true value of Beta (model term) is not zero.
If this sentence is not clear to you, you are responsible for reviewing the Review Materials on Hypothesis Tests and Significance Tests (and other related topics):
P-value for difference in intercepts (ProfessionAthlete): < 0.001
P-value for difference in slopes (Age:ProfessionAthlete): < 0.001
These interpretations are in agreement with what we can easily see in the Interactive Model Plot.
Is length of a movie (Runtime) a good predictor of the movie budget?
Does the relationship between movie length and budget differ by movie genre?
| Movie | Genre | Budget | Runtime |
|---|---|---|---|
| Paranormal Activity 3 | Suspense / Horror | 5 | 83 |
| The Others | Suspense / Horror | 17 | 104 |
| The Lincoln Lawyer | Suspense / Horror | 40 | 118 |
| Fright Night | Suspense / Horror | 30 | 106 |
Genre
Action Suspense / Horror
10 10
Note: If categories have different slopes, correlations for whole dataset will be misleading.
Budget Runtime
Budget 1.00 0.76
Runtime 0.76 1.00
Budget Runtime
Budget 1.00 0.95
Runtime 0.95 1.00
Budget Runtime
Budget 1.00 0.99
Runtime 0.99 1.00
Again, we can examine and interpret the regression model output.
Poll Everywhere - My User Name: penelopepoolereisenbies685
What is the intercept term (estimated beta) for the Suspense / Horror SLR model?
GenreSuspense / Horror term: Difference from Action (baseline) model Intercept
Runtime:GenreSuspense / Horror term: Difference from Action model Slope
Budget = (-286.67 + 218.75) + (3.25 - 2.35)*Runtime
Suspense / Horror SLR Model: Budget = ____ - 0.9*Runtime
In this case, we don’t need to examine the P-values because the model differences between groups are so clear (but we will).
The two intercepts and two slopes are VERY different.
Reminder of Hypothesis Testing concepts:
The SMALLER the P-value, the more evidence there is that the true value of Beta (model term) is not zero.
If this sentence is not clear to you, you are responsible for reviewing the Review Materials on Hypothesis Tests and Significance Tests (and other related topics):
P-value for diff. in intercepts (GenreSuspense / Horror): < 0.001
_____ (Next Question).P-value for diff. in slopes (Runtime:GenreSuspense / Horror term): < 0.001
_____ (Next Question).These interpretations are in agreement with what we can easily see in the Interactive Model Plot.
Poll Everywhere - My User Name: penelopepoolereisenbies685
The difference in intercepts and the difference in slopes between the model for Action movies and the model for Suspense / Horror movies are both _____.
A. statistically significant
B. statistically insignificant
Poll Everywhere - My User Name: penelopepoolereisenbies685
Recall that the cutoff for determining significance of a regression model term based on it’s P-value is 0.05.
Fill in the blank:
The smaller the P-value, the _____ evidence there is that the Beta coefficient is non-zero and the term is useful to the model.
Dataset has three categories of Diamonds:
Colorless, Faint yellow, and Nearly colorlessColorless is first alphabetically so that is the baseline category by default.
Each color category has unique intercept AND a unique slope.
The interactive model plot and abridged regression output are provided.
All Blackboard questions can be answered by rendering .qmd file to examine .html output.
Helpful TIP: In addition to other recommended options, change preview option (see next slide).
For HW 6 you do not have to write any R code.
Instead you are expected to correctly interpret provided output.
Quiz 2 will have similar output WITHOUT the interactive plots.
I have provided the HTML file but you are encouraged to edit the HW 6 file with your own notes and render it.
You can publish your rendered file using Rpubs and save the link.
Recall the House Remodel Data from Lecture 13
These data were clearly modeled by two parallel lines.
What happens if we add an interaction term to test for different slopes?
Abridged Output
This week, ALL of the categorical models could be simplified to multiple SLRs, with same or different slopes.
ALL of the variables have had P-values less than 0.05 so the terms were all useful (except on the previous slide).
There are many many model options where these two facts are not true.
Time permitting, here’s a brief look at a dataset with many explanatory variables:
| Charges | ln_charges | Age | Sex | BMI | Children | Smoker | Region |
|---|---|---|---|---|---|---|---|
| 16884.924 | 9.734176 | 19 | female | 27.90 | 0 | yes | southwest |
| 1725.552 | 7.453303 | 18 | male | 33.77 | 1 | no | southeast |
| 4449.462 | 8.400538 | 28 | male | 33.00 | 3 | no | southeast |
There are 3 quantitative variables:
There are 3 categorical variables:
There are literally hundreds of possible models including interaction terms.
Note that an interaction can also be between two quantitative variables.
You can also have interaction terms with three variables (but I try to avoid those).
How do we sort through all of the possible options?
Model and variable selection methods are the next set of topics.
Categorical Interaction Model
Next Topics
HW 6 is now available and is due on Wed. 3/4.
Quiz 2 is 3/26
To submit an Engagement Question or Comment about material from Lecture 14: Submit it by midnight today (day of lecture).