More about Linear Regression Models in R
2025-02-13
HW 4 is due 2/12/2025
Quiz 1 Will Take Place on Thursday 2/20 in class
Continue discussion of Linear Regression Models in R.
Reading and interpreting regression output
Introduction to Multiple Linear Regression
New Skills from this week will not be on Quiz 1.
In-class Polling (Session ID: bua345s25)
In lecture 8, we discussed the difference between a line function, f(x), and a simple linear regression model.
We discussed how a simple linear regression model looks just like a function,
BUT we interpret models differently.
Models are a simplification of real-world data.
We’ll start today with a model with a straightforward model estimate.
In this course we will use R and RStudio for the predictive analytics lectures.
You will access R and RStudio through Posit Cloud.
I will post R/RStudio files on Posit Cloud that you can access in provided links.
I will also provide demo videos that show how to access files and complete exercises.
NOTE: The free Posit Cloud account is limited to 25 hours per month.
We will also use Posit cloud for quiz questions of predictive analytics skills.
For those who want to download R and RStudio (not required):
In Lecture 8, we also discussed residuals.
Residual: vertical distance between the model line (red line) and the observed y value for an individual observation.
Residuals indicate strength of the overall relationship and if there are outliers.
The smaller the residuals are, the strong the relationship is.
\(Residual = Y_{observed} - \hat{Y}\)
\(\hat{Y}\) is the estimated regression value of Y.
\[\hat{y} = 23.9316 - 0.01653247x\]
True Population Model
\[y_{i} = \beta_{0} + \beta_{1}x_{i} + e_{i}\]
\(\beta_{0}\) is the y-intercept
\(\beta_{1}\) is the slope
\(e\) is the unexplained variability in Y
Estimated Sample Data Model
\[\hat{y} = b_{0} + b_{1}x\]
\(\hat{y}\) is model estimate of y from x
\(b_{0}\) is model estimate of y-intercept
\(b_{1}\) is model estimate of slope
A SLR model is only valid if for straight line relationships between X and Y.
If model is valid:
Model CAN be used to interpolate Y within the range of X used to build model.
MODEL CANNOT be used to extrapolate Y for an X outside of this range.
Why? … Because we don’t know if relationship is the same outside of this range.
Correlation:
Specify Model:
Full Model Output Summary: Each line of model table is a hypothesis test.
Call:
lm(formula = mass ~ height, data = sw)
Residuals:
Min 1Q Median 3Q Max
-39.006 -7.804 0.508 4.007 57.901
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -31.25047 12.81488 -2.439 0.0179 *
height 0.61273 0.07202 8.508 0.0000000000114 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 19.49 on 56 degrees of freedom
Multiple R-squared: 0.5638, Adjusted R-squared: 0.556
F-statistic: 72.38 on 1 and 56 DF, p-value: 0.00000000001138
Sig
below is the P-value for each term
\(\hat{Mass}=-31.25+0.613*Height\)
Model Summary
----------------------------------------------------------------
R 0.751 RMSE 19.153
R-Squared 0.564 MSE 366.835
Adj. R-Squared 0.556 Coef. Var 25.791
Pred R-Squared 0.537 AIC 513.082
MAE 12.868 SBC 519.263
----------------------------------------------------------------
RMSE: Root Mean Square Error
MSE: Mean Square Error
MAE: Mean Absolute Error
AIC: Akaike Information Criteria
SBC: Schwarz Bayesian Criteria
ANOVA
---------------------------------------------------------------------
Sum of
Squares DF Mean Square F Sig.
---------------------------------------------------------------------
Regression 27499.012 1 27499.012 72.378 0.0000
Residual 21276.434 56 379.936
Total 48775.446 57
---------------------------------------------------------------------
Parameter Estimates
------------------------------------------------------------------------------------------
model Beta Std. Error Std. Beta t Sig lower upper
------------------------------------------------------------------------------------------
(Intercept) -31.250 12.815 -2.439 0.018 -56.922 -5.579
height 0.613 0.072 0.751 8.508 0.000 0.468 0.757
------------------------------------------------------------------------------------------
Parameter Estimates
table is two-sided hypothesis:(Intercept)
Line:
\(H_{0}: \beta_{0} = 0\)
\(H_{A}: \beta_{0} \neq 0\)
height
Line:
\(H_{0}: \beta_{1} = 0\)
\(H_{A}: \beta_{1} \neq 0\)
If the P-value < \(\alpha\), we reject \(H_{0}\) and conclude that \(\beta_{1} \neq 0\)
If the slope term in non-zero, and the correlation is moderate to strong:
Now lets make a change to the Star Wars Data and examine how it changes the correlation and the model.
Let’s limit the data to humans only
[1] 0.5363839
Model Summary
---------------------------------------------------------------
R 0.536 RMSE 15.903
R-Squared 0.288 MSE 252.896
Adj. R-Squared 0.248 Coef. Var 20.616
Pred R-Squared 0.083 AIC 173.417
MAE 9.758 SBC 176.404
---------------------------------------------------------------
RMSE: Root Mean Square Error
MSE: Mean Square Error
MAE: Mean Absolute Error
AIC: Akaike Information Criteria
SBC: Schwarz Bayesian Criteria
ANOVA
-------------------------------------------------------------------
Sum of
Squares DF Mean Square F Sig.
-------------------------------------------------------------------
Regression 2042.989 1 2042.989 7.271 0.0148
Residual 5057.929 18 280.996
Total 7100.918 19
-------------------------------------------------------------------
Parameter Estimates
-------------------------------------------------------------------------------------------
model Beta Std. Error Std. Beta t Sig lower upper
-------------------------------------------------------------------------------------------
(Intercept) -81.773 60.598 -1.349 0.194 -209.084 45.539
height 0.905 0.336 0.536 2.696 0.015 0.200 1.610
-------------------------------------------------------------------------------------------
Use the regression output and the data to answer the following questions.
Question 5: What is the correlation, \(r_{xy}\) between mass and height in the Star Wars Humans data?
Question 6: How many outliers are there in this data subset model, i.e. observations far from the others?
Question 7: Is the slope term, \(\beta_{1}\), significant? Assume \(\alpha = 0.05\).
Question 8: If a human character is 190 cm tall, what is their estimated height?
This regression model format can also be used if there multiple explanatory (X
) variables.
If a model has more than one X
variable, it is a MULTIPLE LINEAR REGRESSION model.
We will examine one more dataset today to introduce this concept.
First let’s import and examine the data:
Below is the model output for a regression model relating the size of the living area of a house to it’s selling price.
What is the estimated selling price of a 2000 sq. ft. house, based on this model?
Round your answer to a whole dollar amount.
Model Summary
--------------------------------------------------------------------------
R 0.772 RMSE 45426.628
R-Squared 0.596 MSE 2063578544.951
Adj. R-Squared 0.594 Coef. Var 27.670
Pred R-Squared 0.579 AIC 4863.117
MAE 31692.288 SBC 4873.012
--------------------------------------------------------------------------
RMSE: Root Mean Square Error
MSE: Mean Square Error
MAE: Mean Absolute Error
AIC: Akaike Information Criteria
SBC: Schwarz Bayesian Criteria
ANOVA
------------------------------------------------------------------------------------
Sum of
Squares DF Mean Square F Sig.
------------------------------------------------------------------------------------
Regression 609852999259.857 1 609852999259.857 292.576 0.0000
Residual 412715708990.143 198 2084422772.677
Total 1022568708250.000 199
------------------------------------------------------------------------------------
Parameter Estimates
-------------------------------------------------------------------------------------------------
model Beta Std. Error Std. Beta t Sig lower upper
-------------------------------------------------------------------------------------------------
(Intercept) 16505.199 9262.237 1.782 0.076 -1760.095 34770.493
Living_Area 82.588 4.828 0.772 17.105 0.000 73.066 92.110
-------------------------------------------------------------------------------------------------
Focus on the Parameter Estimates
table to answer this question:
Simple Linear Regression - One X variable
In this case, X is the size of the living area.
This model says that regardless of other factors
a 2500 sq. ft house has a selling price of 222975.
The model ignores number of bathrooms, age of house, etc.
These factors may also be helpful in explaining selling price.
In R and most software adding a variable to our model is as simple as addition.
The challenge is interpretation because we can no longer visualize the model.
There are 3-D visualization tools in R, BUT they are not always helpful.
Instead I recommend extending the SLR model output interpretation to the variables in the model.
One the next slide we’ll add number of bathrooms.
Model Summary
--------------------------------------------------------------------------
R 0.815 RMSE 41412.317
R-Squared 0.665 MSE 1714980011.473
Adj. R-Squared 0.661 Coef. Var 25.289
Pred R-Squared 0.640 AIC 4828.109
MAE 30629.922 SBC 4841.302
--------------------------------------------------------------------------
RMSE: Root Mean Square Error
MSE: Mean Square Error
MAE: Mean Absolute Error
AIC: Akaike Information Criteria
SBC: Schwarz Bayesian Criteria
ANOVA
------------------------------------------------------------------------------------
Sum of
Squares DF Mean Square F Sig.
------------------------------------------------------------------------------------
Regression 679572705955.336 2 339786352977.668 195.157 0.0000
Residual 342996002294.664 197 1741096458.349
Total 1022568708250.000 199
------------------------------------------------------------------------------------
Parameter Estimates
---------------------------------------------------------------------------------------------------
model Beta Std. Error Std. Beta t Sig lower upper
---------------------------------------------------------------------------------------------------
(Intercept) -11553.295 9556.111 -1.209 0.228 -30398.701 7292.110
Living_Area 58.047 5.875 0.543 9.881 0.000 46.462 69.633
Bathrooms 38141.447 6027.411 0.348 6.328 0.000 26254.916 50027.977
---------------------------------------------------------------------------------------------------
Model: \[ Est. Selling Price = -11553.295 + 58.047\times Living Area + 38141.447 \times Bathrooms \]
Interpretation:
If number of bathrooms remains unchanged, each additional square foot is estimated to raise the selling price by about 58 dollars.
If living area remains unchanged, each additional bathroom will raise the estimated selling price by about 38 THOUSAND dollars.
Based on this model, if a house is renovated to increase the square footage by 1000 square feet and two bathrooms are added, what would be estimated change in price?
Round your answer to a whole dollar amount.
Model: \[ Est. Selling Price = -11553.295 + 58.047\times Living Area + 38141.447 \times Bathrooms \]
Next, we add age of the house to the model:
Model Summary
--------------------------------------------------------------------------
R 0.821 RMSE 40864.224
R-Squared 0.673 MSE 1669884825.573
Adj. R-Squared 0.668 Coef. Var 25.018
Pred R-Squared 0.641 AIC 4824.780
MAE 30119.407 SBC 4841.271
--------------------------------------------------------------------------
RMSE: Root Mean Square Error
MSE: Mean Square Error
MAE: Mean Absolute Error
AIC: Akaike Information Criteria
SBC: Schwarz Bayesian Criteria
ANOVA
------------------------------------------------------------------------------------
Sum of
Squares DF Mean Square F Sig.
------------------------------------------------------------------------------------
Regression 688591743135.442 3 229530581045.147 134.704 0.0000
Residual 333976965114.558 196 1703964107.727
Total 1022568708250.000 199
------------------------------------------------------------------------------------
Parameter Estimates
--------------------------------------------------------------------------------------------------
model Beta Std. Error Std. Beta t Sig lower upper
--------------------------------------------------------------------------------------------------
(Intercept) 5775.299 12087.330 0.478 0.633 -18062.622 29613.220
Living_Area 60.614 5.918 0.567 10.243 0.000 48.943 72.285
Bathrooms 30089.928 6913.944 0.274 4.352 0.000 16454.654 43725.201
House_Age -235.721 102.458 -0.112 -2.301 0.022 -437.783 -33.658
--------------------------------------------------------------------------------------------------
Hopefully, the interpretation will seem redundant at this point…
Model: \[ Est. Selling Price = 5775.299 + 60.614\times Living Area + 30089.928 \times Bathrooms - 235.721\times House Age \]
Interpretation:
If number of bathrooms and age of the house remain unchanged, each additional square foot is estimated to raise the selling price by about 61 dollars.
If living area and age of the house remain unchanged, each additional bathroom will raise the estimated selling price by about 30 THOUSAND dollars.
If living area and number of bathrooms remain unchanged, each additional year will LOWER the estimated selling price by about 236 dollars.
What is the estimated price of a house that 2500 square feet with 4 bathrooms that is 20 years old?
Simple linear regression (SLR) models are similar in format to the function of line.
The interpretation is different because SLR models are a simplification of the real world.
A model is only valid for the range data used to create it.
Regression model output includes hypothesis tests of each model coefficient.
Multiple Linear Regression (MLR) is an extension of SLR where we ADD more variables to the model.
HW 4 is due 2/12/2025
To submit an Engagement Question or Comment about material from Lecture 9: Submit it by midnight today (day of lecture).