Based on the Parameter Estimates table for the specified categorical regression model, which category is the baseline category?
Model Summary
--------------------------------------------------------------------------
R 0.924 RMSE 32079.481
R-Squared 0.854 MSE 1029093133.260
Adj. R-Squared 0.848 Coef. Var 8.223
Pred R-Squared 0.836 AIC 1352.620
MAE 28012.693 SBC 1360.792
--------------------------------------------------------------------------
RMSE: Root Mean Square Error
MSE: Mean Square Error
MAE: Mean Absolute Error
AIC: Akaike Information Criteria
SBC: Schwarz Bayesian Criteria
ANOVA
----------------------------------------------------------------------------------
Sum of
Squares DF Mean Square F Sig.
----------------------------------------------------------------------------------
Regression 341892090457.686 2 170946045228.843 157.37 0.0000
Residual 58658308595.823 54 1086264973.997
Total 400550399053.509 56
----------------------------------------------------------------------------------
Parameter Estimates
-----------------------------------------------------------------------------------------------------
model Beta Std. Error Std. Beta t Sig lower upper
-----------------------------------------------------------------------------------------------------
(Intercept) 137549.093 17620.447 7.806 0.000 102222.224 172875.963
Square_Feet 137.879 10.836 0.670 12.725 0.000 116.155 159.602
RemodeledYes 90917.216 8834.268 0.542 10.291 0.000 73205.575 108628.858
-----------------------------------------------------------------------------------------------------
A Comment About Formatted Output
In HW 6 and below I use R coding to format the output to make it easier to read.
The values are IDENTICAL to the unformatted output.
Note: Formatted Output will differ in appearance depending on where it is viewed, i.e. slides, html file, or .Qmd file.
Formatted Abridged Output (Similar to HW 6)
model
Beta
Std.Error
t
Sig
(Intercept)
137549.09
17620.45
7.81
0
Square_Feet
137.88
10.84
12.72
0
RemodeledYes
90917.22
8834.27
10.29
0
Quick Review of Categorical Regression
On Tuesday we covered the Parallel Lines Model:
A Parallel Lines model has two X variables, one quantitative and one categorical variable.
Model estimates a separate SLR model for each category in the categorical variable.
Model assumes all categories have the same SLOPE.
Model estimates a separate INTERCEPT for each category.
Model output shows results of a hypothesis test to determine if each non-baseline category’s intercept is significantly different from baseline intercept.
Interactive Plot of House Remodel Data
Calculations from House Model
By default R chooses baseline categories alphabetically
No is before Yes so un-Remodeled houses are the baseline
Un-Remodeled (No) SLR Model:
Est. Price = 137549.093 + 137.879 * Square_Feet
Remodeled (Yes) SLR Model:
Est. Price = 137549.093 + 137.879 * Square_Feet + 90917.216
Est. Price = 137549.093 + 90917.216 + 137.879 * Square_Feet
Est. Price = 228466.3 + 137.879 * Square_Feet
Interpretation:
Prices of remodeled houses are about 91 thousand dollars more than similar houses without remodeling, after accounting for square footage.
This difference is statistically significant (P-value < 0.001)
HW 6 - Questions 1 - 6
This part of HW 6 examines data similar to the House-Remodel data examined in Lecture 13 and the review question.
The dataset is smaller and the numbers are different, but the questions are essentially the same.
Categorical Regression with Interactions
The categorical models covered so far assume that the SLR models for all categories have the same slope.
How do we examine that assumption?
For example:
In the celebrity data in Lecture 13, the data showed a decrease in earnings as they got older.
Slope was assumed to be IDENTICAL for both males and females
That may not be true for all celebrities.
In the following small dataset, we will look at male celebrities only and examine if actors and athletes salaries follow the same trend.
In this case, we don’t need to examine the P-values because the model differences between groups are so clear (but we will).
The two intercepts and two slopes are VERY different.
Reminder of Hypothesis Testing concepts:
The SMALLER the P-value, the more evidence there is that the true value of Beta (model term) is not zero.
If this sentence is not clear to you, you are responsible for reviewing the Review Materials on Hypothesis Tests and Significance Tests (and other related topics):
In this case, we don’t need to examine the P-values because the model differences between groups are so clear (but we will).
The two intercepts and two slopes are VERY different.
Reminder of Hypothesis Testing concepts:
The SMALLER the P-value, the more evidence there is that the true value of Beta (model term) is not zero.
If this sentence is not clear to you, you are responsible for reviewing the Review Materials on Hypothesis Tests and Significance Tests (and other related topics):
P-value for diff. in intercepts (GenreSuspense / Horror): < 0.001
Intercepts for these two distinct genre SLR models are _____ (Next Question).
P-value for diff. in slopes (Runtime:GenreSuspense / Horror term): < 0.001
Slopes for for these two distinct genre SLR models are _____ (Next Question).
These interpretations are in agreement with what we can easily see in the Interactive Model Plot.
💥 Lecture 14 In-class Exercises - Q4 💥
Abridged Output
The difference in intercepts and the difference in slopes between the model for Action movies and the model for Suspense / Horror movies are both _____.
A. statistically significant
B. statistically insignificant
💥 Lecture 14 In-class Exercises - Q5 💥
Recall that the cutoff for determining significance of a regression model term based on it’s P-value is 0.05.
Fill in the blank:
The smaller the P-value, the _____ evidence there is that the Beta coefficient is non-zero and the term is useful to the model.
HW 6 - Questions 7 - 16
Dataset has three categories of Diamonds:
Colorless, Faint yellow, and Nearly colorless
Colorless is first alphabetically so that is the baseline category by default.
Each color category has unique intercept AND a unique slope.
The interactive model plot and abridged regression output are provided.
All Blackboard questions can be answered by rendering .qmd file to examine .html output.
Helpful TIP: In addition to other recommended options, change preview option (see next slide).
HW 6 - Change HTML Preview Option
For HW 6 you do not have to write any R code.
Instead you are expected to correctly interpret provided output.
Quiz 2 will have similar output WITHOUT the interactive plots.
Change the following option in the Basic tab of the R Markdown options:
Show output preview in Viewer Pane
Looking Ahead - What’s Next?
This week, ALL of the categorical models could be simplified to multiple SLRs, with same or different slopes.
ALL of the variables have had P-values less than 0.05 so the terms were all useful.
There are many many model options where these two facts are not true.
Time permitting, here’s a brief look at a dataset with many explanatory variables:
Charges
ln_charges
Age
Sex
BMI
Children
Smoker
Region
16884.924
9.734176
19
female
27.90
0
yes
southwest
1725.552
7.453303
18
male
33.77
1
no
southeast
4449.462
8.400538
28
male
33.00
3
no
southeast
Insurance Data Model and Variable Selection
There are 3 quantitative variables:
Age, BMI, and Children
There are 3 categorical variables:
Sex, Smoker, Region
There are literally hundreds of possible models including interaction terms.
Note that an interaction can also be between two quantitative variables.
You can also have interaction terms with three variables (but I try to avoid those).
How do we sort through all of the possible options?
Software helps us pare down all the possible models to a few choices.
Analyst then uses critical thinking and examination of data to determine final model.
Model and variable selection methods are the next set of topics.
Key Points from Today
Categorical Interaction Model
Separate SLR for each group.
BOTH slopes and intercepts can differ by category
We can test if interaction term (slope difference) is significant.
Next Topics
Comparing model goodness of fit
Introduction to variable selection
HW 6 is now available and is due on Wed. 3/6.
Date of Quiz 2 has been changed to Tuesday, 4/1.
To submit an Engagement Question or Comment about material from Lecture 14: Submit it by midnight today (day of lecture).