2024-02-22
Today’s plan 📋
Review of SLR and MLR model assumptions
Review of Normal Distribution
Review of LN Transformation
SLR Model Output and Multiple Linear Regression (MLR)
Examining Regression Model Output
Understanding hypotheses being tested
Interpreting regression model output
Introduction to Multiple Linear Regression
Adding to a model
Interpreting model output
Working Through HW 5
Review: For simple linear regression, there must be a linear relationship between X, the explanatory variable, and Y, the response variable.
The response variable, must be approximately normally distributed.
Recall that normally distributed means symmetric and bell-shaped.
What if it’s not.
One common solution is a linear transformation.
Financial data such as real estate data, prices, etc. are commonly right-skewed.
A good transformation for right-skwed data is the Natural Log (LN) Transformation.
In HW 5 we work through:
How the LN transformation ‘normalizes’ the distribution of the response.
How to ‘back-transform’ model results to return to original scale of the data, e.g. US dollars.
Histograms are an effective tool for examining the distribution of the data.
LEFT SKEWED
Tail pulled out to LEFT
Low outliers
e.g. Human Lifespan
NORMAL/SYMMETRIC
Data appear in a symmetric bell-shaped curve
No graphic evidence of outliers
e.g. Test scores
RIGHT SKEWED
Tail pulled out to RIGHT
High outliers
e.g. Real Estate Data
Below is the model output for a regression model relating the size of the living area of a house to it’s selling price.
What is the estimated selling price of a 2300 sq. ft. house, based on this model?
Round your answer to a whole dollar amount.real_estate <- read_csv(“data/Real_Estate.csv”, show_col_types = F)
Model Summary
----------------------------------------------------------------------
R 0.772 RMSE 45655.479
R-Squared 0.596 Coef. Var 27.670
Adj. R-Squared 0.594 MSE 2084422772.677
Pred R-Squared 0.579 MAE 31692.288
----------------------------------------------------------------------
RMSE: Root Mean Square Error
MSE: Mean Square Error
MAE: Mean Absolute Error
ANOVA
------------------------------------------------------------------------------------
Sum of
Squares DF Mean Square F Sig.
------------------------------------------------------------------------------------
Regression 609852999259.857 1 609852999259.857 292.576 0.0000
Residual 412715708990.143 198 2084422772.677
Total 1022568708250.000 199
------------------------------------------------------------------------------------
Parameter Estimates
-------------------------------------------------------------------------------------------------
model Beta Std. Error Std. Beta t Sig lower upper
-------------------------------------------------------------------------------------------------
(Intercept) 16505.199 9262.237 1.782 0.076 -1760.095 34770.493
Living_Area 82.588 4.828 0.772 17.105 0.000 73.066 92.110
-------------------------------------------------------------------------------------------------
Focus on the Parameter Estimates
table to answer this question:
Simple Linear Regression - One X variable
In this case, X is the size of the living area.
This model says that regardless of other factors
a 2500 sq. ft house has a selling price of 222975.
The model ignores number of bathrooms, age of house, etc.
These factors may also be helpful in explaining selling price.
Correlation between Living Area and Selling price is 0.77
This a is strong correlation, but maybe we can explain more of the variability in the data.
This linear regression model format can also be used if there multiple explanatory (X
) variables.
If a model has more than one X
variable, it is a MULTIPLE LINEAR REGRESSION model.
We will examine one more dataset today to introduce this concept.
First let’s import and examine the data:
In R and most software adding a variable to our model is as simple as addition.
The challenge is interpretation because we can no longer visualize the model.
There are 3-D visualization tools in R, BUT they are not always helpful.
Instead I recommend extending the SLR model output interpretation to the variables in the model.
One the next slide we’ll add number of bathrooms.
Model Summary
----------------------------------------------------------------------
R 0.815 RMSE 41726.448
R-Squared 0.665 Coef. Var 25.289
Adj. R-Squared 0.661 MSE 1741096458.349
Pred R-Squared 0.640 MAE 30629.922
----------------------------------------------------------------------
RMSE: Root Mean Square Error
MSE: Mean Square Error
MAE: Mean Absolute Error
ANOVA
------------------------------------------------------------------------------------
Sum of
Squares DF Mean Square F Sig.
------------------------------------------------------------------------------------
Regression 679572705955.336 2 339786352977.668 195.157 0.0000
Residual 342996002294.664 197 1741096458.349
Total 1022568708250.000 199
------------------------------------------------------------------------------------
Parameter Estimates
---------------------------------------------------------------------------------------------------
model Beta Std. Error Std. Beta t Sig lower upper
---------------------------------------------------------------------------------------------------
(Intercept) -11553.295 9556.111 -1.209 0.228 -30398.701 7292.110
Living_Area 58.047 5.875 0.543 9.881 0.000 46.462 69.633
Bathrooms 38141.447 6027.411 0.348 6.328 0.000 26254.916 50027.977
---------------------------------------------------------------------------------------------------
Model: \[ Est. Selling Price = -11553.295 + 58.047\times Living Area + 38141.447 \times Bathrooms \]
Interpretation:
If number of bathrooms remains unchanged, each additional square foot is estimated to raise the selling price by about 58 dollars.
If living area remains unchanged, each additional bathroom will raise the estimated selling price by about 38 THOUSAND dollars.
Based on this model, if a house is renovated to increase the square footage by 1000 square feet and two bathrooms are added, what would be estimated change in price?
Round your answer to a whole dollar amount.
Model: \[ Est. Selling Price = -11553.295 + 58.047\times Living Area + 38141.447 \times Bathrooms \]
Next, we add age of the house to the model:
Model Summary
----------------------------------------------------------------------
R 0.821 RMSE 41279.100
R-Squared 0.673 Coef. Var 25.018
Adj. R-Squared 0.668 MSE 1703964107.727
Pred R-Squared 0.641 MAE 30119.407
----------------------------------------------------------------------
RMSE: Root Mean Square Error
MSE: Mean Square Error
MAE: Mean Absolute Error
ANOVA
------------------------------------------------------------------------------------
Sum of
Squares DF Mean Square F Sig.
------------------------------------------------------------------------------------
Regression 688591743135.442 3 229530581045.147 134.704 0.0000
Residual 333976965114.558 196 1703964107.727
Total 1022568708250.000 199
------------------------------------------------------------------------------------
Parameter Estimates
--------------------------------------------------------------------------------------------------
model Beta Std. Error Std. Beta t Sig lower upper
--------------------------------------------------------------------------------------------------
(Intercept) 5775.299 12087.330 0.478 0.633 -18062.622 29613.220
Living_Area 60.614 5.918 0.567 10.243 0.000 48.943 72.285
Bathrooms 30089.928 6913.944 0.274 4.352 0.000 16454.654 43725.201
House_Age -235.721 102.458 -0.112 -2.301 0.022 -437.783 -33.658
--------------------------------------------------------------------------------------------------
Hopefully, the interpretation will seem redundant at this point…
Model: \[ Est. Selling Price = 5775.299 + 60.614\times Living Area + 30089.928 \times Bathrooms - 235.721\times House Age \]
Interpretation:
If number of bathrooms and age of the house remain unchanged, each additional square foot is estimated to raise the selling price by about 61 dollars.
If living area and age of the house remain unchanged, each additional bathroom will raise the estimated selling price by about 30 THOUSAND dollars.
If living area and number of bathrooms remain unchanged, each additional year will LOWER the estimated selling price by about 236 dollars.
What is the estimated price of a house that 2500 square feet with 4 bathrooms that is 20 years old?
Today we will focus on
How to navigate AND edit Quarto (.qmd) files
Getting started on HW 5.
Demonstration and Explanation of a Natural Log transformation
Two more In-class Exercises
In HW 5, you create a new variable, ln_Charges
the natural log of Charges. Charges are the medical insurance charges for people in the dataset.
Based on the summary values and this histograms of these two variables, answer the following questions.
Question 4. The variable Charges
is
left-skewed
normally distributed
right-skewed
Question 5. The transformed variable ln_Charges
is
left-skewed
normally distributed
right-skewed
Multiple Linear Regression (MLR) is an extension of SLR where we ADD more variables to the model.
A key assumption of SLR and MLR is that the response, Y, is normally distributed.
If the response is right-skewed which is common in data having to do with money, a good strategy is to use a natural log transformation.
This process is illustrated in HW 5.
To submit an Engagement Question or Comment about material from Lecture 12: Submit by midnight today (day of lecture). Click on Link next to the ❓ under Lecture 12