Lecture 25 - Simple Linear Regression Continued

Penelope Pooler Eisenbies
MAS 261

2023-11-28

Housekeeping

  • Today’s plan 📋

    • Review of Simple Linear Regression (SLR) Concepts from Lecture 24

      • Function vs. Model

      • Examining Real Data

      • Creating a Model

      • Interpreting an Regression Model

    • Simple Linear Regression Continued

      • More about Extrapolation
    • Examining Regression Model Output

      • Understanding hypotheses being tested

      • Interpreting regression model output

Review: R and RStudio 🪄

  • Review: You have two options to facilitate your introduction to R and RStudio:

  • If you are comfortable with coding: Start with Option 1, but still sign up for Posit Cloud account.

    • We will use Posit Cloud for Quizzes.
  • If you are nervous about coding: Choose Option 2.

  • For both options: I can help with download/install issues during office hours.

  • What I do: I maintain a Posit Cloud account for helping students but I do most of my work on my laptop.

  • NOTE: We will use R and RStudio in class during MOST lectures

    • You can use either Posit Cloud or your laptop.

Upcoming Dates

  • Quick Note about Lecture 24

    • Typos in lecture and Student R lecture files have been fixed

    • Apologies for lack of recording

    • All new material from that lecture will be quickly reveiwed today

  • HW 8 Will be posted by Thursday and will include

    • Portfolio calculations

    • Simple Linear Regression

      • Interpretation of Coefficients

      • Interpretation of Regression output

      • Estimating a residual

      • Conceptual/Calculation questions about extrapolation

  • Final Exam is on 12/19/23

💥 Lecture 25 In-class Exercises - Q1 💥

  • In lecture 24, we discussed the difference between a line function, f(x), and a simple linear regression model.

  • We use functions and models to do very similar mathematical calculations.

    • We interpret them very differently
  • We’ll start today with a couple calculations.

    • Then review the concept of model vs. function.

Model in R

hp_cty_mod <- lm(mpg_c ~ hp, data=gt_cars)
hp_cty_mod$coefficients
(Intercept)          hp 
23.93159715 -0.01653247 

\[\hat{y} = 23.9316 - 0.01653247x\]

  • Question 1: What is the City MPG for a vehicle with 800 horsepower?

💥 Lecture 25 In-class Exercises - Q2 💥

  • In lecture 24, we also discussed residuals.

  • Residual: vertical distance between the model line (red line) and the observed y value for an individual observation.

  • Residuals indicate strenght of the overall relationship and if there are outliers.

    • The smaller the residuals are, the strong the relationship is.

    • \(Residual = Y_{observed} - \hat{Y}\)

    • \(\hat{Y}\) is the estimated regression value of Y.

\[\hat{y} = 23.9316 - 0.01653247x\]

Model in R

hp_cty_mod <- lm(mpg_c ~ hp, data=gt_cars)
hp_cty_mod$coefficients
(Intercept)          hp 
23.93159715 -0.01653247 
  • Question 2: What is the City MPG residual for the BMW i8?

    • City MPG = 28
    • hp = 357

Simple Linear Regression Model

True Population Model

\[y_{i} = \beta_{0} + \beta_{1}x_{i} + e_{i}\]

  • \(\beta_{0}\) is the y-intercept

  • \(\beta_{1}\) is the slope

  • \(e\) is the unexplained variability in Y

Estimated Sample Data Model

\[\hat{y} = b_{0} + b_{1}x\]

  • \(\hat{y}\) is model estimate of y from x

  • \(b_{0}\) is model estimate of y-intercept

  • \(b_{1}\) is model estimate of slope

  • Each \(e_{i}\) is a residual.

    • y obs. - reg. estimate of y

    • \(e_{i} = y_{i} - \hat{y}_{i}\)

  • Software estimates model with smallest sum of all squared residuals

    • minimizes \(\sum_{i=1}^ne_{i}^2\)

Function of a Line vs. Regression Model

Function of a Line

\[y = mx + b\]

Exact precise mathmatical relationship with NO NOISE

Regression Model Equation

\[\hat{y} = b_{0} + b_{1}x\] Estimated line that is simultaneously as close as possible to all observations.

Models ARE NOT Functions

Favorite Quote attributed to George Box:

“All models are wrong, but some are useful.”


Common student query:

If all models are wrong, why do we bother modeling?

Models are considered ‘wrong’ because they simplify the ‘messiness’ of the real world to a mathematical relationship.

Models can’t (and shouldn’t) include all the noise of real world data

  • BUT models are still useful in understanding how variables are related to each other.

Yummy Example from Lecture 24

To make Russian Tea Cake Cookies, you need 6 tablespoons of powdered sugar to make 3 dozen cookies.

Here is the full recipe.


Here is the equation (y-intercept = 0):

\(y = 6x\)


Is this a function or a model?

Model Example from Lecture 24

The plot and model show the relationship between height and mass for all Star Wars characters for whom data were available.


How can we tell that this plot depicts a model and not a linear function?

💥 Lecture 25 In-class Exercises - Q3 & Q4 💥

  • Question 3 - Extrapolation

    • Can we use the Star Wars model to estimate the mass of a character that is 260 cm (8.5 feet) tall?


  • Question 4 - Interpolation

    • There are no characters in this dataset that are exactly 140 cm tall. Can we use this model to estimate the mass of a 140 cm (4.6 feet) character?

Model Assumptions and Limitations

  • A SLR model is only valid if for straight line relationships between X and Y.

    • Correlation should also be moderate to strong

    • Next week: What to do if the relationship is curvilinear.

  • If model is valid:

    • Model CAN be used to interpolate Y within the range of X used to build model.

    • MODEL CANNOT be used to extrapolate Y for an X outside of this range.

    • Why? … Because we don’t know if relationship is the same outside of this range.

Examining Regression Model Output

Correlation:

[1] 0.7508582

Model Coefficients:

(Intercept)      height 
-31.2504692   0.6127301 

Full Model Output Summary: Each line of model table is a hypothesis test.


Call:
lm(formula = mass ~ height, data = sw)

Residuals:
    Min      1Q  Median      3Q     Max 
-39.006  -7.804   0.508   4.007  57.901 

Coefficients:
             Estimate Std. Error t value        Pr(>|t|)    
(Intercept) -31.25047   12.81488  -2.439          0.0179 *  
height        0.61273    0.07202   8.508 0.0000000000114 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 19.49 on 56 degrees of freedom
Multiple R-squared:  0.5638,    Adjusted R-squared:  0.556 
F-statistic: 72.38 on 1 and 56 DF,  p-value: 0.00000000001138

SLR Model Output - More Readable

Sig below is the P-value for each term

\(\hat{Mass}=-31.25+0.613*Height\)

                         Model Summary                          
---------------------------------------------------------------
R                       0.751       RMSE                19.492 
R-Squared               0.564       Coef. Var           25.791 
Adj. R-Squared          0.556       MSE                379.936 
Pred R-Squared          0.537       MAE                 12.868 
---------------------------------------------------------------
 RMSE: Root Mean Square Error 
 MSE: Mean Square Error 
 MAE: Mean Absolute Error 

                                ANOVA                                 
---------------------------------------------------------------------
                 Sum of                                              
                Squares        DF    Mean Square      F         Sig. 
---------------------------------------------------------------------
Regression    27499.012         1      27499.012    72.378    0.0000 
Residual      21276.434        56        379.936                     
Total         48775.446        57                                    
---------------------------------------------------------------------

                                   Parameter Estimates                                     
------------------------------------------------------------------------------------------
      model       Beta    Std. Error    Std. Beta      t        Sig       lower     upper 
------------------------------------------------------------------------------------------
(Intercept)    -31.250        12.815                 -2.439    0.018    -56.922    -5.579 
     height      0.613         0.072        0.751     8.508    0.000      0.468     0.757 
------------------------------------------------------------------------------------------

Hypothesis Test on Each Line of Regression OutPut

  • Each line of the Regression Parameter Estimates table is two-sided hpothesis:

(Intercept) Line:

\(H_{0}: \beta_{0} = 0\)

\(H_{A}: \beta_{0} \neq 0\)

  • If the P-value < \(\alpha\), we reject \(H_{0}\) and conclude that \(\beta_{0} \neq 0\)

height Line:

\(H_{0}: \beta_{1} = 0\)

\(H_{A}: \beta_{1} \neq 0\)

  • If the P-value < \(\alpha\), we reject \(H_{0}\) and conclude that \(\beta_{1} \neq 0\)

  • If the slope term in non-zero, and the correlstion is moderate to strong:

    • there is a relationship between x and y

Model of Star Wars Human Characters

  • Now lets make a change to the Star Wars Data and examine how it changes the correlation and the model.

  • Let’s limit the data to humans only

Star Wars Human Character Regression Model Output

                         Model Summary                          
---------------------------------------------------------------
R                       0.536       RMSE                16.763 
R-Squared               0.288       Coef. Var           20.616 
Adj. R-Squared          0.248       MSE                280.996 
Pred R-Squared          0.083       MAE                  9.758 
---------------------------------------------------------------
 RMSE: Root Mean Square Error 
 MSE: Mean Square Error 
 MAE: Mean Absolute Error 

                               ANOVA                                
-------------------------------------------------------------------
                Sum of                                             
               Squares        DF    Mean Square      F        Sig. 
-------------------------------------------------------------------
Regression    2042.989         1       2042.989    7.271    0.0148 
Residual      5057.929        18        280.996                    
Total         7100.918        19                                   
-------------------------------------------------------------------

                                    Parameter Estimates                                     
-------------------------------------------------------------------------------------------
      model       Beta    Std. Error    Std. Beta      t        Sig        lower     upper 
-------------------------------------------------------------------------------------------
(Intercept)    -81.773        60.598                 -1.349    0.194    -209.084    45.539 
     height      0.905         0.336        0.536     2.696    0.015       0.200     1.610 
-------------------------------------------------------------------------------------------

💥 Lecture 25 In-class Exercises - Q5 - Q8 💥


Use the regression output and the data to answer the following questions.


  • Question 5: What is the correlation, \(r_{xy}\) between mass and height in the Star Wars Humans data?

  • Question 6: How many outliers are there in this data subset model, i.e. observations far from the others?

    • Follow-up: What is reasonable explanation from these outliers?

    • Hint: Hollywood culture

  • Question 7: Is the slope term, \(\beta_{1}\), significant? Assume \(\alpha = 0.05\).

  • Question 8: If a human character is 190 cm tall, what is their estimated height?

Key Points from Today and from Lecture 24

  • Simple linear regression (SLR) models are similar in format to the function of line.

  • The interpretation is very different because SLR models are simplification of the real world.

  • Box said “All models are wrong, but some are useful”

    • Box is refering to the inherent simplication of modeling that leaves out the noise of the real world.
  • A model is only valid for the range data used to create it.

    • Outside of that range we are extrapolating which is invalid.
  • Regression model output includes hypothesis tests of each model coefficient.

    • For SLR, the hypothesis test of \(\beta_{1}\) is a primary indication of the validity of the model.

To submit an Engagement Question or Comment about material from Lecture 25: Submit by midnight today (day of lecture). Click on Link next to the under Lecture 25