Simple Linear Regression Continued
2024-11-18
Today’s plan 📋
Review of Simple Linear Regression (SLR) Concepts from Lecture 24
Function vs. Model
Examining Real Data
Creating a Model
Interpreting an Regression Model
Simple Linear Regression Continued
Examining Regression Model Output
Understanding hypotheses being tested
Interpreting regression model output
Quiz 2 Scores and Solutions are posted.
Please go through your test carefully
If you missed a question due to a typo, please let me know.
I would be happy to go through any questions you missed with you.
HW 8 is now available and is due on Thursday, 12/6.
There will be no lecture on Thursday 11/22.
In-person Final Exam is on 12/16/24 at 5:15 PM
In this course we will use R and RStudio to understand statistical concepts.
You will access R and RStudio through Posit Cloud.
I will post R/RStudio files on Posit Cloud that you can access in provided links.
I will also provide demo videos that show how to access files and complete exercises.
NOTE: The free Posit Cloud account is limited to 25 hours per month.
I demo how to download completed work so that you can use this allotment efficiently.
For those who want to go further with R/RStudio:
In lecture 24, we discussed the difference between a line function, f(x), and a simple linear regression model.
We use functions and models to do very similar mathematical calculations.
We’ll start today with a couple calculations.
In lecture 24, we also discussed residuals.
Residual: vertical distance between the model line (red line) and the observed y value for an individual observation.
Residuals indicate strenght of the overall relationship and if there are outliers.
The smaller the residuals are, the strong the relationship is.
\(Residual = Y_{observed} - \hat{Y}\)
\(\hat{Y}\) is the estimated regression value of Y.
\[\hat{y} = 23.9316 - 0.01653247x\]
True Population Model
\[y_{i} = \beta_{0} + \beta_{1}x_{i} + e_{i}\]
\(\beta_{0}\) is the y-intercept
\(\beta_{1}\) is the slope
\(e\) is the unexplained variability in Y
Estimated Sample Data Model
\[\hat{y} = b_{0} + b_{1}x\]
\(\hat{y}\) is model estimate of y from x
\(b_{0}\) is model estimate of y-intercept
\(b_{1}\) is model estimate of slope
Each \(e_{i}\) is a residual.
y obs. - reg. estimate of y
\(e_{i} = y_{i} - \hat{y}_{i}\)
Software estimates model with smallest sum of all squared residuals
Function of a Line
\[y = mx + b\]
Exact precise mathmatical relationship with NO NOISE
Regression Model Equation
\[\hat{y} = b_{0} + b_{1}x\] Estimated line that is simultaneously as close as possible to all observations.
Favorite Quote attributed to George Box:
“All models are wrong, but some are useful.”
Common student query:
If all models are wrong, why do we bother modeling?
Models are considered ‘wrong’ because they simplify the ‘messiness’ of the real world to a mathematical relationship.
Models can’t (and shouldn’t) include all the noise of real world data
To make Russian Tea Cake Cookies, you need 6 tablespoons of powdered sugar to make 3 dozen cookies.
Here is the full recipe.
Here is the equation (y-intercept = 0):
\(y = 6x\)
Is this a function or a model?
The plot and model show the relationship between height and mass for all Star Wars characters for whom data were available.
How can we tell that this plot depicts a model and not a linear function?
Question 3 - Extrapolation
Question 4 - Interpolation
A SLR model is only valid if for straight line relationships between X and Y.
Correlation should also be moderate to strong
Next week: What to do if the relationship is curvilinear.
If model is valid:
Model CAN be used to interpolate Y within the range of X used to build model.
MODEL CANNOT be used to extrapolate Y for an X outside of this range.
Why? … Because we don’t know if relationship is the same outside of this range.
Correlation:
[1] 0.7508582
Model Coefficients:
(Intercept) height
-31.2504692 0.6127301
Full Model Output Summary: Each line of model table is a hypothesis test.
Call:
lm(formula = mass ~ height, data = sw)
Residuals:
Min 1Q Median 3Q Max
-39.006 -7.804 0.508 4.007 57.901
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -31.25047 12.81488 -2.439 0.0179 *
height 0.61273 0.07202 8.508 0.0000000000114 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 19.49 on 56 degrees of freedom
Multiple R-squared: 0.5638, Adjusted R-squared: 0.556
F-statistic: 72.38 on 1 and 56 DF, p-value: 0.00000000001138
Sig
below is the P-value for each term
\(\hat{Mass}=-31.25+0.613*Height\)
Model Summary
----------------------------------------------------------------
R 0.751 RMSE 19.153
R-Squared 0.564 MSE 366.835
Adj. R-Squared 0.556 Coef. Var 25.791
Pred R-Squared 0.537 AIC 513.082
MAE 12.868 SBC 519.263
----------------------------------------------------------------
RMSE: Root Mean Square Error
MSE: Mean Square Error
MAE: Mean Absolute Error
AIC: Akaike Information Criteria
SBC: Schwarz Bayesian Criteria
ANOVA
---------------------------------------------------------------------
Sum of
Squares DF Mean Square F Sig.
---------------------------------------------------------------------
Regression 27499.012 1 27499.012 72.378 0.0000
Residual 21276.434 56 379.936
Total 48775.446 57
---------------------------------------------------------------------
Parameter Estimates
------------------------------------------------------------------------------------------
model Beta Std. Error Std. Beta t Sig lower upper
------------------------------------------------------------------------------------------
(Intercept) -31.250 12.815 -2.439 0.018 -56.922 -5.579
height 0.613 0.072 0.751 8.508 0.000 0.468 0.757
------------------------------------------------------------------------------------------
(Intercept)
Line: If the P-value (Sig
) < \(\alpha\), we reject \(H_{0}\) and conclude that \(\beta_{0} \neq 0\)
\(H_{0}: \beta_{0} = 0\)
\(H_{A}: \beta_{0} \neq 0\)
height
Line: If the P-value (Sig
) < \(\alpha\), we reject \(H_{0}\) and conclude that \(\beta_{1} \neq 0\)
\(H_{0}: \beta_{1} = 0\)
\(H_{A}: \beta_{1} \neq 0\)
Model Summary
---------------------------------------------------------------
R 0.536 RMSE 15.903
R-Squared 0.288 MSE 252.896
Adj. R-Squared 0.248 Coef. Var 20.616
Pred R-Squared 0.083 AIC 173.417
MAE 9.758 SBC 176.404
---------------------------------------------------------------
RMSE: Root Mean Square Error
MSE: Mean Square Error
MAE: Mean Absolute Error
AIC: Akaike Information Criteria
SBC: Schwarz Bayesian Criteria
ANOVA
-------------------------------------------------------------------
Sum of
Squares DF Mean Square F Sig.
-------------------------------------------------------------------
Regression 2042.989 1 2042.989 7.271 0.0148
Residual 5057.929 18 280.996
Total 7100.918 19
-------------------------------------------------------------------
Parameter Estimates
-------------------------------------------------------------------------------------------
model Beta Std. Error Std. Beta t Sig lower upper
-------------------------------------------------------------------------------------------
(Intercept) -81.773 60.598 -1.349 0.194 -209.084 45.539
height 0.905 0.336 0.536 2.696 0.015 0.200 1.610
-------------------------------------------------------------------------------------------
Use the regression output and the data to answer the following questions.
Question 5: What is the correlation, \(r_{xy}\) between mass and height in the Star Wars Humans data?
Question 6: How many outliers are there in this data subset model, i.e. observations far from the others?
Follow-up: What is reasonable explanation from these outliers?
Hint: Hollywood culture
Question 7: Is the slope term, \(\beta_{1}\), significant? Assume \(\alpha = 0.05\).
Question 8: If a human character is 190 cm tall, what is their estimated height?
Simple linear regression (SLR) models are similar in format to the function of line.
The interpretation is very different because SLR models are simplification of the real world.
Box said “All models are wrong, but some are useful”
A model is only valid for the range data used to create it.
Regression model output includes hypothesis tests of each model coefficient.
To submit an Engagement Question or Comment about material from Lecture 25: Submit it by midnight today (day of lecture).