2023-11-28
Today’s plan 📋
Review of Simple Linear Regression (SLR) Concepts from Lecture 24
Function vs. Model
Examining Real Data
Creating a Model
Interpreting an Regression Model
Simple Linear Regression Continued
Examining Regression Model Output
Understanding hypotheses being tested
Interpreting regression model output
Review: You have two options to facilitate your introduction to R and RStudio:
If you are comfortable with coding: Start with Option 1, but still sign up for Posit Cloud account.
If you are nervous about coding: Choose Option 2.
For both options: I can help with download/install issues during office hours.
What I do: I maintain a Posit Cloud account for helping students but I do most of my work on my laptop.
NOTE: We will use R and RStudio in class during MOST lectures
Quick Note about Lecture 24
Typos in lecture and Student R lecture files have been fixed
Apologies for lack of recording
All new material from that lecture will be quickly reveiwed today
HW 8 Will be posted by Thursday and will include
Portfolio calculations
Simple Linear Regression
Interpretation of Coefficients
Interpretation of Regression output
Estimating a residual
Conceptual/Calculation questions about extrapolation
Final Exam is on 12/19/23
In lecture 24, we discussed the difference between a line function, f(x), and a simple linear regression model.
We use functions and models to do very similar mathematical calculations.
We’ll start today with a couple calculations.
In lecture 24, we also discussed residuals.
Residual: vertical distance between the model line (red line) and the observed y value for an individual observation.
Residuals indicate strenght of the overall relationship and if there are outliers.
The smaller the residuals are, the strong the relationship is.
\(Residual = Y_{observed} - \hat{Y}\)
\(\hat{Y}\) is the estimated regression value of Y.
\[\hat{y} = 23.9316 - 0.01653247x\]
True Population Model
\[y_{i} = \beta_{0} + \beta_{1}x_{i} + e_{i}\]
\(\beta_{0}\) is the y-intercept
\(\beta_{1}\) is the slope
\(e\) is the unexplained variability in Y
Estimated Sample Data Model
\[\hat{y} = b_{0} + b_{1}x\]
\(\hat{y}\) is model estimate of y from x
\(b_{0}\) is model estimate of y-intercept
\(b_{1}\) is model estimate of slope
Each \(e_{i}\) is a residual.
y obs. - reg. estimate of y
\(e_{i} = y_{i} - \hat{y}_{i}\)
Software estimates model with smallest sum of all squared residuals
Function of a Line
\[y = mx + b\]
Exact precise mathmatical relationship with NO NOISE
Regression Model Equation
\[\hat{y} = b_{0} + b_{1}x\] Estimated line that is simultaneously as close as possible to all observations.
Favorite Quote attributed to George Box:
“All models are wrong, but some are useful.”
Common student query:
If all models are wrong, why do we bother modeling?
Models are considered ‘wrong’ because they simplify the ‘messiness’ of the real world to a mathematical relationship.
Models can’t (and shouldn’t) include all the noise of real world data
To make Russian Tea Cake Cookies, you need 6 tablespoons of powdered sugar to make 3 dozen cookies.
Here is the full recipe.
Here is the equation (y-intercept = 0):
\(y = 6x\)
Is this a function or a model?
The plot and model show the relationship between height and mass for all Star Wars characters for whom data were available.
How can we tell that this plot depicts a model and not a linear function?
Question 3 - Extrapolation
Question 4 - Interpolation
A SLR model is only valid if for straight line relationships between X and Y.
Correlation should also be moderate to strong
Next week: What to do if the relationship is curvilinear.
If model is valid:
Model CAN be used to interpolate Y within the range of X used to build model.
MODEL CANNOT be used to extrapolate Y for an X outside of this range.
Why? … Because we don’t know if relationship is the same outside of this range.
Correlation:
[1] 0.7508582
Model Coefficients:
(Intercept) height
-31.2504692 0.6127301
Full Model Output Summary: Each line of model table is a hypothesis test.
Call:
lm(formula = mass ~ height, data = sw)
Residuals:
Min 1Q Median 3Q Max
-39.006 -7.804 0.508 4.007 57.901
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -31.25047 12.81488 -2.439 0.0179 *
height 0.61273 0.07202 8.508 0.0000000000114 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 19.49 on 56 degrees of freedom
Multiple R-squared: 0.5638, Adjusted R-squared: 0.556
F-statistic: 72.38 on 1 and 56 DF, p-value: 0.00000000001138
Sig
below is the P-value for each term
\(\hat{Mass}=-31.25+0.613*Height\)
Model Summary
---------------------------------------------------------------
R 0.751 RMSE 19.492
R-Squared 0.564 Coef. Var 25.791
Adj. R-Squared 0.556 MSE 379.936
Pred R-Squared 0.537 MAE 12.868
---------------------------------------------------------------
RMSE: Root Mean Square Error
MSE: Mean Square Error
MAE: Mean Absolute Error
ANOVA
---------------------------------------------------------------------
Sum of
Squares DF Mean Square F Sig.
---------------------------------------------------------------------
Regression 27499.012 1 27499.012 72.378 0.0000
Residual 21276.434 56 379.936
Total 48775.446 57
---------------------------------------------------------------------
Parameter Estimates
------------------------------------------------------------------------------------------
model Beta Std. Error Std. Beta t Sig lower upper
------------------------------------------------------------------------------------------
(Intercept) -31.250 12.815 -2.439 0.018 -56.922 -5.579
height 0.613 0.072 0.751 8.508 0.000 0.468 0.757
------------------------------------------------------------------------------------------
Parameter Estimates
table is two-sided hpothesis:(Intercept)
Line:
\(H_{0}: \beta_{0} = 0\)
\(H_{A}: \beta_{0} \neq 0\)
height
Line:
\(H_{0}: \beta_{1} = 0\)
\(H_{A}: \beta_{1} \neq 0\)
If the P-value < \(\alpha\), we reject \(H_{0}\) and conclude that \(\beta_{1} \neq 0\)
If the slope term in non-zero, and the correlstion is moderate to strong:
Now lets make a change to the Star Wars Data and examine how it changes the correlation and the model.
Let’s limit the data to humans only
Model Summary
---------------------------------------------------------------
R 0.536 RMSE 16.763
R-Squared 0.288 Coef. Var 20.616
Adj. R-Squared 0.248 MSE 280.996
Pred R-Squared 0.083 MAE 9.758
---------------------------------------------------------------
RMSE: Root Mean Square Error
MSE: Mean Square Error
MAE: Mean Absolute Error
ANOVA
-------------------------------------------------------------------
Sum of
Squares DF Mean Square F Sig.
-------------------------------------------------------------------
Regression 2042.989 1 2042.989 7.271 0.0148
Residual 5057.929 18 280.996
Total 7100.918 19
-------------------------------------------------------------------
Parameter Estimates
-------------------------------------------------------------------------------------------
model Beta Std. Error Std. Beta t Sig lower upper
-------------------------------------------------------------------------------------------
(Intercept) -81.773 60.598 -1.349 0.194 -209.084 45.539
height 0.905 0.336 0.536 2.696 0.015 0.200 1.610
-------------------------------------------------------------------------------------------
Use the regression output and the data to answer the following questions.
Question 5: What is the correlation, \(r_{xy}\) between mass and height in the Star Wars Humans data?
Question 6: How many outliers are there in this data subset model, i.e. observations far from the others?
Follow-up: What is reasonable explanation from these outliers?
Hint: Hollywood culture
Question 7: Is the slope term, \(\beta_{1}\), significant? Assume \(\alpha = 0.05\).
Question 8: If a human character is 190 cm tall, what is their estimated height?
Simple linear regression (SLR) models are similar in format to the function of line.
The interpretation is very different because SLR models are simplification of the real world.
Box said “All models are wrong, but some are useful”
A model is only valid for the range data used to create it.
Regression model output includes hypothesis tests of each model coefficient.
To submit an Engagement Question or Comment about material from Lecture 25: Submit by midnight today (day of lecture). Click on Link next to the ❓ under Lecture 25