MAS 261 - Lecture 25
Simple Linear Regression Continued
Housekeeping
Today’s plan
Review of Simple Linear Regression (SLR) Concepts from Lecture 24
Function vs. Model
Examining Real Data
Creating a Model
Interpreting an Regression Model
Simple Linear Regression Continued
- More about Extrapolation
Examining Regression Model Output
Understanding hypotheses being tested
Interpreting regression model output
More Housekeeping and Upcoming Dates
Quiz 2 Scores and Solutions are posted.
Please go through your test carefully
If you missed a question due to a typo, please let me know.
I would be happy to go through any questions you missed with you.
HW 8 is now available and is due on Thursday, 12/6.
There will be no lecture on Thursday 11/22.
In-person Final Exam is on 12/16/24 at 5:15 PM
- Timed Remote option will be available at 8:30 PM on 12/16 and must be completed before 10:00 PM on 12/17.
R and RStudio
In this course we will use R and RStudio to understand statistical concepts.
You will access R and RStudio through Posit Cloud.
- Sign up for a Free Posit Cloud Account
I will post R/RStudio files on Posit Cloud that you can access in provided links.
I will also provide demo videos that show how to access files and complete exercises.
NOTE: The free Posit Cloud account is limited to 25 hours per month.
I demo how to download completed work so that you can use this allotment efficiently.
For those who want to go further with R/RStudio:
- I have added a new page to the MAS 261 website, Installing R and RStudio
Lecture 25 In-class Exercise - Q1
In lecture 24, we discussed the difference between a line function, f(x), and a simple linear regression model.
We use functions and models to do very similar mathematical calculations.
- We interpret them very differently
We’ll start today with a couple calculations.
- Then review the concept of model vs. function.
Lecture 25 In-class Exercises - Q2
In lecture 24, we also discussed residuals.
Residual: vertical distance between the model line (red line) and the observed y value for an individual observation.
Residuals indicate strenght of the overall relationship and if there are outliers.
The smaller the residuals are, the strong the relationship is.
\(Residual = Y_{observed} - \hat{Y}\)
\(\hat{Y}\) is the estimated regression value of Y.
\[\hat{y} = 23.9316 - 0.01653247x\]
Simple Linear Regression Model
True Population Model
\[y_{i} = \beta_{0} + \beta_{1}x_{i} + e_{i}\]
\(\beta_{0}\) is the y-intercept
\(\beta_{1}\) is the slope
\(e\) is the unexplained variability in Y
Estimated Sample Data Model
\[\hat{y} = b_{0} + b_{1}x\]
\(\hat{y}\) is model estimate of y from x
\(b_{0}\) is model estimate of y-intercept
\(b_{1}\) is model estimate of slope
Each \(e_{i}\) is a residual.
y obs. - reg. estimate of y
\(e_{i} = y_{i} - \hat{y}_{i}\)
Software estimates model with smallest sum of all squared residuals
- minimizes \(\sum_{i=1}^ne_{i}^2\)
Function of a Line vs. Regression Model
Function of a Line
\[y = mx + b\]
Exact precise mathmatical relationship with NO NOISE
Regression Model Equation
\[\hat{y} = b_{0} + b_{1}x\] Estimated line that is simultaneously as close as possible to all observations.
Models ARE NOT Functions
Favorite Quote attributed to George Box:
“All models are wrong, but some are useful.”
Common student query:
If all models are wrong, why do we bother modeling?
Models are considered ‘wrong’ because they simplify the ‘messiness’ of the real world to a mathematical relationship.
Models can’t (and shouldn’t) include all the noise of real world data
- BUT models are still useful in understanding how variables are related to each other.
Yummy Example from Lecture 24
To make Russian Tea Cake Cookies, you need 6 tablespoons of powdered sugar to make 3 dozen cookies.
Here is the full recipe.
Here is the equation (y-intercept = 0):
\(y = 6x\)
Is this a function or a model?
Model Example from Lecture 24
The plot and model show the relationship between height and mass for all Star Wars characters for whom data were available.
How can we tell that this plot depicts a model and not a linear function?
Lecture 25 In-class Exercises - Q3-Q4
Question 3 - Extrapolation
- Can we use the Star Wars model to estimate the mass of a character that is 260 cm (8.5 feet) tall?
Question 4 - Interpolation
- There are no characters in this dataset that are exactly 140 cm tall. Can we use this model to estimate the mass of a 140 cm (4.6 feet) character?
Model Assumptions and Limitations
A SLR model is only valid if for straight line relationships between X and Y.
Correlation should also be moderate to strong
Next week: What to do if the relationship is curvilinear.
If model is valid:
Model CAN be used to interpolate Y within the range of X used to build model.
MODEL CANNOT be used to extrapolate Y for an X outside of this range.
Why? … Because we don’t know if relationship is the same outside of this range.
Examining Regression Model Output
Correlation:
[1] 0.7508582
Model Coefficients:
(Intercept) height
-31.2504692 0.6127301
Full Model Output Summary: Each line of model table is a hypothesis test.
Call:
lm(formula = mass ~ height, data = sw)
Residuals:
Min 1Q Median 3Q Max
-39.006 -7.804 0.508 4.007 57.901
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -31.25047 12.81488 -2.439 0.0179 *
height 0.61273 0.07202 8.508 0.0000000000114 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 19.49 on 56 degrees of freedom
Multiple R-squared: 0.5638, Adjusted R-squared: 0.556
F-statistic: 72.38 on 1 and 56 DF, p-value: 0.00000000001138
SLR Model Output - More Readable
Sig
below is the P-value for each term
\(\hat{Mass}=-31.25+0.613*Height\)
Model Summary
----------------------------------------------------------------
R 0.751 RMSE 19.153
R-Squared 0.564 MSE 366.835
Adj. R-Squared 0.556 Coef. Var 25.791
Pred R-Squared 0.537 AIC 513.082
MAE 12.868 SBC 519.263
----------------------------------------------------------------
RMSE: Root Mean Square Error
MSE: Mean Square Error
MAE: Mean Absolute Error
AIC: Akaike Information Criteria
SBC: Schwarz Bayesian Criteria
ANOVA
---------------------------------------------------------------------
Sum of
Squares DF Mean Square F Sig.
---------------------------------------------------------------------
Regression 27499.012 1 27499.012 72.378 0.0000
Residual 21276.434 56 379.936
Total 48775.446 57
---------------------------------------------------------------------
Parameter Estimates
------------------------------------------------------------------------------------------
model Beta Std. Error Std. Beta t Sig lower upper
------------------------------------------------------------------------------------------
(Intercept) -31.250 12.815 -2.439 0.018 -56.922 -5.579
height 0.613 0.072 0.751 8.508 0.000 0.468 0.757
------------------------------------------------------------------------------------------
Two-sided Hypothesis Tests in Regression Output
(Intercept)
Line: If the P-value (Sig
) < \(\alpha\), we reject \(H_{0}\) and conclude that \(\beta_{0} \neq 0\)
\(H_{0}: \beta_{0} = 0\)
\(H_{A}: \beta_{0} \neq 0\)
height
Line: If the P-value (Sig
) < \(\alpha\), we reject \(H_{0}\) and conclude that \(\beta_{1} \neq 0\)
\(H_{0}: \beta_{1} = 0\)
\(H_{A}: \beta_{1} \neq 0\)
- If the slope term (\(\beta_{1}\)) is non-zero, and the correlation is moderate to strong, there is a significant relationship between x and y.
Model of Star Wars Human Characters
- Now we filter the Star Wars Data to ‘Humans’ and examine how it changes the correlation and the model.
Star Wars Human Character Regression Model Output
Model Summary
---------------------------------------------------------------
R 0.536 RMSE 15.903
R-Squared 0.288 MSE 252.896
Adj. R-Squared 0.248 Coef. Var 20.616
Pred R-Squared 0.083 AIC 173.417
MAE 9.758 SBC 176.404
---------------------------------------------------------------
RMSE: Root Mean Square Error
MSE: Mean Square Error
MAE: Mean Absolute Error
AIC: Akaike Information Criteria
SBC: Schwarz Bayesian Criteria
ANOVA
-------------------------------------------------------------------
Sum of
Squares DF Mean Square F Sig.
-------------------------------------------------------------------
Regression 2042.989 1 2042.989 7.271 0.0148
Residual 5057.929 18 280.996
Total 7100.918 19
-------------------------------------------------------------------
Parameter Estimates
-------------------------------------------------------------------------------------------
model Beta Std. Error Std. Beta t Sig lower upper
-------------------------------------------------------------------------------------------
(Intercept) -81.773 60.598 -1.349 0.194 -209.084 45.539
height 0.905 0.336 0.536 2.696 0.015 0.200 1.610
-------------------------------------------------------------------------------------------
Lecture 25 In-class Exercises - Q5-Q8
Use the regression output and the data to answer the following questions.
Question 5: What is the correlation, \(r_{xy}\) between mass and height in the Star Wars Humans data?
Question 6: How many outliers are there in this data subset model, i.e. observations far from the others?
Follow-up: What is reasonable explanation from these outliers?
Hint: Hollywood culture
Question 7: Is the slope term, \(\beta_{1}\), significant? Assume \(\alpha = 0.05\).
Question 8: If a human character is 190 cm tall, what is their estimated height?
Key Points from Today
Simple linear regression (SLR) models are similar in format to the function of line.
The interpretation is very different because SLR models are simplification of the real world.
Box said “All models are wrong, but some are useful”
- Box is referring to the inherent simplification of modeling that leaves out the noise of the real world.
A model is only valid for the range data used to create it.
- Outside of that range we are extrapolating which is invalid.
Regression model output includes hypothesis tests of each model coefficient.
- For SLR, the hypothesis test of \(\beta_{1}\) is a primary indication of the validity of the model.
To submit an Engagement Question or Comment about material from Lecture 25: Submit it by midnight today (day of lecture).