Bedrooms | Bathrooms | Price | Year_built | Address | City | State |
---|---|---|---|---|---|---|
4 | 4.5 | 1365000 | 2006 | 5442 GRANNY WHITE PIKE | BRENTWOOD | TN |
3 | 3.0 | 300000 | 1987 | 5608 HEARTHSTONE LN | BRENTWOOD | TN |
3 | 4.5 | 566920 | 1968 | 1706 OLD HICKORY BLVD | BRENTWOOD | TN |
5 | 5.0 | 650000 | 1977 | 1428 OLD HICKORY BLVD | BRENTWOOD | TN |
BUA 345 - Lecture 13
Categorical Regression - Parallel Lines Model
Housekeeping
HW 5 is due 2/26/2025 - 2 day grace period
- Demo videos are posted
Today’s plan
Review of
SLR and MLR
Hypothesis testing in Regression
Categorical Parallel Lines Model
In-class Polling (Session ID: bua345s25)
Review Question - Import data
- Recall The Tennessee Real Estate data:
Review Question - Natural Log transformation
- We will build an MLR model using the natural log of Price (ln_Price)
- This transformation is needed because Price is RIGHT-SKEWED.
Price | ln_Price | Bedrooms | Bathrooms | Year_built | Address | City | State |
---|---|---|---|---|---|---|---|
1365000 | 14.12666 | 4 | 4.5 | 2006 | 5442 GRANNY WHITE PIKE | BRENTWOOD | TN |
300000 | 12.61154 | 3 | 3.0 | 1987 | 5608 HEARTHSTONE LN | BRENTWOOD | TN |
566920 | 13.24797 | 3 | 4.5 | 1968 | 1706 OLD HICKORY BLVD | BRENTWOOD | TN |
Review Question - Histograms
Histogram of
Price
shows distribution of raw data is right-skewed with high outliers.Histogram of
ln_Price
shows distribution of transformed data is symmetric and normally distributed.
Lecture 13 In-class Exercises - Q1 - Review
Session ID: bua345s25
Back-transforming Model Estimates
Based on the model output, What is the estimated price of a house with 4 bedrooms and 3 bathrooms (rounded to closest $1000)?
Code
```{r review question incomplete R code, eval=F, echo=T}
(y_est <- ___ + 0.056*4 + 0.375*3) # fill in intercept from R output
(est_dollars <- exp(y_est)) # back_transform y estimate
# -3 is correct input to round to closest thousand
round(est_dollars,-3) # withot piping
est_dollars |> round(-3) # with piping
```
- NOTE: All 3 steps above could be done with one line but it is helpful to break it down when learning.
Regression Terms - \(R^2\) and Adjusted \(R^2\)
R is the correlation coefficient, \(R_{XY}\)
Regression Output only shows absolute value of R.
\(R^2\) is \(R_{XY}^2\) the square of the correlation coefficient.
\(R^2\) is also called coefficient of determination.
Meaning of \(R^2\) in SLR: Proportion of variability in y explained by X
Adjusted \(R^2\) adjusts \(R^2\) for number of explanatory (X) variables in model.
Much more to come about this.
Meaning of Adjusted \(R^2\) in MLR is a little less specific but it is similar to \(R^2\).
Other values will be covered in upcoming lectures.
Review of Parameter Estimates Output
model
column lists intercept and X variables in modelBeta
column shows the estimate of the \(\beta\) coefficients for each variable in model.Std. Error
shows variability of each estimated Beta coefficient estimate.t
=Beta/Std. Error
, the test statistic for each Beta coefficient estimate.Sig
is P-value for Hypothesis test for each Beta coefficient estimate:
Review of Parameter Estimates Output
- Reminder of Example Output (
Sig
is P-value column):
- Recall Interpretation guidelines for P-value:
Types of Data - Review
Types of Data - More on Categorical Data
Categorical variables are categories that describe data observations
- Gender, Location, Hair Color, Eye Color, Location, etc.
Ordinal Categories have an OBJECTIVE order:
Grades: A, B, C, D
College year: Freshman, Sophomore, Junior, Senior
Nominal Categories don’t have an objective order:
Location
Hair color
Gender
Data Examples - R Star Wars Dataset
Dataset of characters from Star Wars franchise
Type
?starwars
in the console to review data documentation.
Code
Rows: 87
Columns: 14
$ name <chr> "Luke Skywalker", "C-3PO", "R2-D2", "Darth Vader", "Le…
$ height <int> 172, 167, 96, 202, 150, 178, 165, 97, 183, 182, 188, 1…
$ mass <dbl> 77.0, 75.0, 32.0, 136.0, 49.0, 120.0, 75.0, 32.0, 84.0…
$ hair_color <chr> "blond", NA, NA, "none", "brown", "brown, grey", "brow…
$ skin_color <chr> "fair", "gold", "white, blue", "white", "light", "ligh…
$ eye_color <chr> "blue", "yellow", "red", "yellow", "brown", "blue", "b…
$ birth_year <dbl> 19.0, 112.0, 33.0, 41.9, 19.0, 52.0, 47.0, NA, 24.0, 5…
$ sex <chr> "male", "none", "none", "male", "female", "male", "fem…
$ gender <chr> "masculine", "masculine", "masculine", "masculine", "f…
$ homeworld <chr> "Tatooine", "Tatooine", "Naboo", "Tatooine", "Alderaan…
$ species <chr> "Human", "Droid", "Droid", "Human", "Human", "Human", …
$ films <list> <"A New Hope", "The Empire Strikes Back", "Return of …
$ vehicles <list> <"Snowspeeder", "Imperial Speeder Bike">, <>, <>, <>,…
$ starships <list> <"X-wing", "Imperial shuttle">, <>, <>, "TIE Advanced…
Examining Data
A good way to examine categorical data variables is to examine how many observations are in each category.
For example, we can examine the Star Wars character data by species and gender
- There are WAY TOO MANY species…
gender
species feminine masculine
Aleena 0 1
Besalisk 0 1
Cerean 0 1
Chagrian 0 1
Clawdite 1 0
Droid 1 5
Dug 0 1
Ewok 0 1
Geonosian 0 1
Gungan 0 3
Human 9 26
Hutt 0 1
Iktotchi 0 1
Kaleesh 0 1
Kaminoan 1 1
Kel Dor 0 1
Mirialan 2 0
Mon Calamari 0 1
Muun 0 1
Nautolan 0 1
Neimodian 0 1
Pau'an 0 1
Quermian 0 1
Rodian 0 1
Skakoan 0 1
Sullustan 0 1
Tholothian 1 0
Togruta 1 0
Toong 0 1
Toydarian 0 1
Trandoshan 0 1
Twi'lek 1 1
Vulptereen 0 1
Wookiee 0 2
Xexto 0 1
Yoda's species 0 1
Zabrak 0 2
Lecture 13 In-class Exercises - Q2
Session ID: bua345s25
Is species a nominal or ordinal variable?
Star Wars Example - Examining categorical Data
Human
is the most common species.- We can filter the data to look at those characters only.
- For example, we can examine prevalence of each gender and eye color among the human characters.
Code
eye_color
gender blue blue-gray brown dark hazel unknown yellow
feminine 3 0 4 0 1 1 0
masculine 9 1 12 1 1 0 2
Lecture 13 In-class Exercises - Q3
Session ID: bua345s25
Which R command is used to summarize the number of observations in each gender x eye_color combination?
NOTE: This useful command will also be used in HW 6.
Code
eye_color
gender blue blue-gray brown dark hazel unknown yellow
feminine 3 0 4 0 1 1 0
masculine 9 1 12 1 1 0 2
Data Examples - GT cars dataset
Deluxe automobiles from the 2014-2017 period
Type
?gt::gtcars
in the console to see data documentation.
Rows: 47
Columns: 15
$ mfr <chr> "Ford", "Ferrari", "Ferrari", "Ferrari", "Ferrari", "…
$ model <chr> "GT", "458 Speciale", "458 Spider", "458 Italia", "48…
$ year <dbl> 2017, 2015, 2015, 2014, 2016, 2015, 2017, 2015, 2015,…
$ trim <chr> "Base Coupe", "Base Coupe", "Base", "Base Coupe", "Ba…
$ bdy_style <chr> "coupe", "coupe", "convertible", "coupe", "coupe", "c…
$ hp <dbl> 647, 597, 562, 562, 661, 553, 680, 652, 731, 949, 573…
$ hp_rpm <dbl> 6250, 9000, 9000, 9000, 8000, 7500, 8250, 8000, 8250,…
$ trq <dbl> 550, 398, 398, 398, 561, 557, 514, 504, 509, 664, 476…
$ trq_rpm <dbl> 5900, 6000, 6000, 6000, 3000, 4750, 5750, 6000, 6000,…
$ mpg_c <dbl> 11, 13, 13, 13, 15, 16, 12, 11, 11, 12, 21, 16, 11, 1…
$ mpg_h <dbl> 18, 17, 17, 17, 22, 23, 17, 16, 16, 16, 22, 22, 18, 2…
$ drivetrain <chr> "rwd", "rwd", "rwd", "rwd", "rwd", "rwd", "awd", "awd…
$ trsmn <chr> "7a", "7a", "7a", "7a", "7a", "7a", "7a", "7a", "7a",…
$ ctry_origin <chr> "United States", "Italy", "Italy", "Italy", "Italy", …
$ msrp <dbl> 447000, 291744, 263553, 233509, 245400, 198973, 29800…
Lecture 13 In-class Exercises - Q4
Session ID: bua345s25
Which variable in gt_cars, body style (bdy_style
) or year
could be treated as ordinal?
Categorical Regression
Categorical variables can (and should) be used in linear regression models
If categories exist in the data and we ignore them, then we assume that the linear relationship is the SAME FOR all categories.
The following two examples illustrate the importance of adding a categorical variable to a regression model when needed.
Data Example - Celebrity Salaries Data
Many (not all) celebrities see a decrease in their annual income as they age.
There is a negative relationship between wages and ages.
Is this relationship the same for males a females?
Celebrity | Earnings | Age | Gender |
---|---|---|---|
Taylor Swift | 67 | 27 | Female |
Lady Gaga | 59 | 31 | Female |
Gisele Bundchen | 54 | 32 | Female |
Beyonce | 54 | 35 | Female |
Kim Kardashian | 51 | 36 | Female |
Sofia Vergara | 28 | 44 | Female |
Celebrity Salaries
Examining categories and Correlations
Celebrity Salaries Data - Examining Model Options
Option 1: SLR
Model assumes no difference between males and females.
In this case we use Base R command for regression,
lm
.Model created with
lm
can be used to create an interactive plot.The interactive plot shows the model equation when the cursor is on the line.
\(\hat{y} = 136.89 - 2.23\times Age\)
Celebrity Salaries Data - Examining Model Options
Option 2: Categorical Regression Model
SLR model is okay, but we can do better.
It is logical to create a model that specifies a difference in earnings between males and females.
We add
Gender
to the model to test if this difference is significant.The interactive plot shows each model equation when the cursor is on the line.
Females: \(\hat{y} = 134.11 - 2.37\times Age\)
Males: \(\hat{y} = 149.5 - 2.37\times Age\)
Celebrity Salaries Data - MLR Model Output
We see the model equation (poorly formatted) for each gender, in the plot.
We can also get these equations from the model output, but it requires a little work.
Examine the model output:
Model Summary
--------------------------------------------------------------
R 0.987 RMSE 2.545
R-Squared 0.975 MSE 6.477
Adj. R-Squared 0.971 Coef. Var 4.959
Pred R-Squared 0.962 AIC 83.299
MAE 2.197 SBC 86.389
--------------------------------------------------------------
RMSE: Root Mean Square Error
MSE: Mean Square Error
MAE: Mean Absolute Error
AIC: Akaike Information Criteria
SBC: Schwarz Bayesian Criteria
ANOVA
---------------------------------------------------------------------
Sum of
Squares DF Mean Square F Sig.
---------------------------------------------------------------------
Regression 4007.300 2 2003.650 251.332 0.0000
Residual 103.637 13 7.972
Total 4110.937 15
---------------------------------------------------------------------
Parameter Estimates
--------------------------------------------------------------------------------------------
model Beta Std. Error Std. Beta t Sig lower upper
--------------------------------------------------------------------------------------------
(Intercept) 134.112 4.141 32.384 0.000 125.165 143.059
Age -2.370 0.114 -0.918 -20.710 0.000 -2.617 -2.123
GenderMale 15.383 1.420 0.480 10.830 0.000 12.315 18.452
--------------------------------------------------------------------------------------------
Getting Model Equations from Regression Output
By default R chooses baseline categories alphabetically
Female
is beforeMale
soFemale
is the baselineFemale SLR Model:
Est. Earnings = 134.112 - 2.37 * Age
Male SLR Model:
Est. Earnings = 134.112 - 2.37 * Age + 15.383
Est. Earnings = 134.112 + 15.383 - 2.37*Age
Est. Earnings = 149.505 - 2.37 * Age
The difference between the intercepts for Females and Males is shown in the model output.
Difference in intercepts is labeled with name of categorical variable and category
Difference (Increase) for Males is labeled
GenderMale
and equals15.383
Lecture 13 In-class Exercises - Q5
Session ID: bua345s25
Based on our categorical regression model, is the difference between male and female earnings (approx. 15 $M), statistically significant?
HINT: Look at the p-value for the GenderMale
term in the model to answer this question.
Parameter Estimates Table
Data Example - House Remodeling Data
- What is the effect of remodeling on house selling price?
Price | Square_Feet | Remodeled |
---|---|---|
554000 | 2702 | No |
484000 | 2378 | No |
391000 | 1846 | No |
354000 | 1820 | No |
410000 | 1794 | No |
349000 | 1768 | No |
House Remodeling Data
Examine Categories and Correlations
- What is the effect of remodeling on house selling price?
Code
Remodeled
No Yes
29 28
Price Square_Feet
Price 1.00 0.75
Square_Feet 0.75 1.00
House Remodeling Data - Examining Model Options
Option 1: SLR
SLR model assumes no difference due to remodeling
Again, we use Base R command for regression,
lm
Model created with
lm
can be used to create an interactive plot.Interactive plot shows the model equation when the cursor is on the line.
House Remodeling Data - Examining Model Options
Option 2: Categorical Regression Model
SLR model is okay, but there is probably a difference between
Remodeled
and un-Remodeled
houses.To test for that difference we add the categorical variable
Remodeled
to the model.The interactive plot shows each model equation when the cursor is on the line.
House Remodel data - MLR Model Output
We can see the model equation (poorly formatted) for each category, in the plot.
We can also get these equations from the model output, but it requires a little work.
Examine the model output:
Model Summary
--------------------------------------------------------------------------
R 0.924 RMSE 32079.481
R-Squared 0.854 MSE 1029093133.260
Adj. R-Squared 0.848 Coef. Var 8.223
Pred R-Squared 0.836 AIC 1352.620
MAE 28012.693 SBC 1360.792
--------------------------------------------------------------------------
RMSE: Root Mean Square Error
MSE: Mean Square Error
MAE: Mean Absolute Error
AIC: Akaike Information Criteria
SBC: Schwarz Bayesian Criteria
ANOVA
----------------------------------------------------------------------------------
Sum of
Squares DF Mean Square F Sig.
----------------------------------------------------------------------------------
Regression 341892090457.686 2 170946045228.843 157.37 0.0000
Residual 58658308595.823 54 1086264973.997
Total 400550399053.509 56
----------------------------------------------------------------------------------
Parameter Estimates
-----------------------------------------------------------------------------------------------------
model Beta Std. Error Std. Beta t Sig lower upper
-----------------------------------------------------------------------------------------------------
(Intercept) 137549.093 17620.447 7.806 0.000 102222.224 172875.963
Square_Feet 137.879 10.836 0.670 12.725 0.000 116.155 159.602
RemodeledYes 90917.216 8834.268 0.542 10.291 0.000 73205.575 108628.858
-----------------------------------------------------------------------------------------------------
Getting Model Equations from Regression Output
By default R chooses baseline categories alphabetically
No
is beforeYes
so un-Remodeled
houses are the baselineun-Remodeled SLR Model:
Est. Price = 137549.093 + 137.879 * Square_Feet
Remodeled SLR Model:
Est. Price = 137549.093 + 137.879 * Square_Feet + 90917.216
Est. Price = 137549.093 + 90917.216 + 137.879 * Square_Feet
Est. Price = _____ + 137.879 * Square_Feet
Lecture 13 In-class Exercises - Q6
Session ID: bua345s25
The difference between the intercepts for Remodeled
and un-Remodeled
houses is shown in the model output.
Difference in intercepts is labeled with name of categorical variable and category
Difference for remodeling is labeled
RemodeledYes
and equals90917.216
What is the intercept for the prices of Remodeled
houses in the Categorical Regression model (Round to closest thousand ($K).
Lecture 13 In-class Exercises - Q7
Session ID: bua345s25
Based on our categorical regression model, is the difference in selling price between remodeled (Remodeled = Yes
) and un-remodeled (Remodeled = No
) homes statistically significant?
Parameter Estimates Table
Key Points from Today
Categorical Parallel Lines Model
- Separate SLR model for each category.
- Modeling categories simultaneously with one mode is
- more efficient
- more accurate
Lecture 14: Similar model BUT each category has a different slope
HW 5 is due on 2/26
HW 6 will be posted on 2/27 and due on 3/5.
First set of data in HW 6 is almost identical to
house_remodel
data.
To submit an Engagement Question or Comment about material from Lecture 13: Submit it by midnight today (day of lecture).