Housekeeping

  • Today’s plan 📋

  • Review of Simple Linear Regression (SLR) Concepts from Lecture 24

    • Function vs. Model

    • Examining Real Data

    • Creating a Model

    • Interpreting an Regression Model

  • Simple Linear Regression Continued

    • More about Extrapolation
  • Examining Regression Model Output

    • Understanding hypotheses being tested

    • Interpreting regression model output

More Housekeeping and Upcoming Dates

  • Quiz 2 Scores and Solutions are posted.

    • Please go through your test carefully

    • If you missed a question due to a typo, please let me know.

    • I would be happy to go through any questions you missed with you.

  • HW 8 is now available and is due on Thursday, 12/6.

  • There will be no lecture on Thursday 11/22.

  • In-person Final Exam is on 12/16/24 at 5:15 PM

    • Timed Remote option will be available at 8:30 PM on 12/16 and must be completed before 10:00 PM on 12/17.

R and RStudio

  • In this course we will use R and RStudio to understand statistical concepts.

  • You will access R and RStudio through Posit Cloud.

  • I will post R/RStudio files on Posit Cloud that you can access in provided links.

  • I will also provide demo videos that show how to access files and complete exercises.

  • NOTE: The free Posit Cloud account is limited to 25 hours per month.

    • I demo how to download completed work so that you can use this allotment efficiently.

    • For those who want to go further with R/RStudio:

💥 Lecture 25 In-class Exercise - Q1 💥

  • In lecture 24, we discussed the difference between a line function, f(x), and a simple linear regression model.

  • We use functions and models to do very similar mathematical calculations.

    • We interpret them very differently
  • We’ll start today with a couple calculations.

    • Then review the concept of model vs. function.

Model in R

hp_cty_mod <- lm(mpg_c ~ hp, data=gt_cars)
hp_cty_mod$coefficients
(Intercept)          hp 
23.93159715 -0.01653247 

\[\hat{y} = 23.9316 - 0.01653247x\]

  • Question 1: What is the City MPG for a vehicle with 800 horsepower?

💥 Lecture 25 In-class Exercises - Q2 💥

  • In lecture 24, we also discussed residuals.

  • Residual: vertical distance between the model line (red line) and the observed y value for an individual observation.

  • Residuals indicate strenght of the overall relationship and if there are outliers.

    • The smaller the residuals are, the strong the relationship is.

    • \(Residual = Y_{observed} - \hat{Y}\)

    • \(\hat{Y}\) is the estimated regression value of Y.

\[\hat{y} = 23.9316 - 0.01653247x\]

Model in R

hp_cty_mod <- lm(mpg_c ~ hp, data=gt_cars)
hp_cty_mod$coefficients
(Intercept)          hp 
23.93159715 -0.01653247 
  • Question 2: What is the City MPG residual for the BMW i8?

    • City MPG = 28
    • hp = 357

Simple Linear Regression Model

True Population Model

\[y_{i} = \beta_{0} + \beta_{1}x_{i} + e_{i}\]

  • \(\beta_{0}\) is the y-intercept

  • \(\beta_{1}\) is the slope

  • \(e\) is the unexplained variability in Y

Estimated Sample Data Model

\[\hat{y} = b_{0} + b_{1}x\]

  • \(\hat{y}\) is model estimate of y from x

  • \(b_{0}\) is model estimate of y-intercept

  • \(b_{1}\) is model estimate of slope

  • Each \(e_{i}\) is a residual.

    • y obs. - reg. estimate of y

    • \(e_{i} = y_{i} - \hat{y}_{i}\)

  • Software estimates model with smallest sum of all squared residuals

    • minimizes \(\sum_{i=1}^ne_{i}^2\)

Function of a Line vs. Regression Model

Function of a Line

\[y = mx + b\]

Exact precise mathmatical relationship with NO NOISE

Regression Model Equation

\[\hat{y} = b_{0} + b_{1}x\] Estimated line that is simultaneously as close as possible to all observations.

Models ARE NOT Functions

Favorite Quote attributed to George Box:

“All models are wrong, but some are useful.”


Common student query:

If all models are wrong, why do we bother modeling?

Models are considered ‘wrong’ because they simplify the ‘messiness’ of the real world to a mathematical relationship.

Models can’t (and shouldn’t) include all the noise of real world data

  • BUT models are still useful in understanding how variables are related to each other.

Yummy Example from Lecture 24

To make Russian Tea Cake Cookies, you need 6 tablespoons of powdered sugar to make 3 dozen cookies.

Here is the full recipe.


Here is the equation (y-intercept = 0):

\(y = 6x\)


Is this a function or a model?

Model Example from Lecture 24

The plot and model show the relationship between height and mass for all Star Wars characters for whom data were available.


How can we tell that this plot depicts a model and not a linear function?

💥 Lecture 25 In-class Exercises - Q3-Q4 💥

  • Question 3 - Extrapolation

    • Can we use the Star Wars model to estimate the mass of a character that is 260 cm (8.5 feet) tall?


  • Question 4 - Interpolation

    • There are no characters in this dataset that are exactly 140 cm tall. Can we use this model to estimate the mass of a 140 cm (4.6 feet) character?

Model Assumptions and Limitations

  • A SLR model is only valid if for straight line relationships between X and Y.

    • Correlation should also be moderate to strong

    • Next week: What to do if the relationship is curvilinear.

  • If model is valid:

    • Model CAN be used to interpolate Y within the range of X used to build model.

    • MODEL CANNOT be used to extrapolate Y for an X outside of this range.

    • Why? … Because we don’t know if relationship is the same outside of this range.

Examining Regression Model Output

Correlation:

[1] 0.7508582

Model Coefficients:

(Intercept)      height 
-31.2504692   0.6127301 

Full Model Output Summary: Each line of model table is a hypothesis test.


Call:
lm(formula = mass ~ height, data = sw)

Residuals:
    Min      1Q  Median      3Q     Max 
-39.006  -7.804   0.508   4.007  57.901 

Coefficients:
             Estimate Std. Error t value        Pr(>|t|)    
(Intercept) -31.25047   12.81488  -2.439          0.0179 *  
height        0.61273    0.07202   8.508 0.0000000000114 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 19.49 on 56 degrees of freedom
Multiple R-squared:  0.5638,    Adjusted R-squared:  0.556 
F-statistic: 72.38 on 1 and 56 DF,  p-value: 0.00000000001138

SLR Model Output - More Readable

Sig below is the P-value for each term

\(\hat{Mass}=-31.25+0.613*Height\)

                         Model Summary                           
----------------------------------------------------------------
R                        0.751       RMSE                19.153 
R-Squared                0.564       MSE                366.835 
Adj. R-Squared           0.556       Coef. Var           25.791 
Pred R-Squared           0.537       AIC                513.082 
MAE                     12.868       SBC                519.263 
----------------------------------------------------------------
 RMSE: Root Mean Square Error 
 MSE: Mean Square Error 
 MAE: Mean Absolute Error 
 AIC: Akaike Information Criteria 
 SBC: Schwarz Bayesian Criteria 

                                ANOVA                                 
---------------------------------------------------------------------
                 Sum of                                              
                Squares        DF    Mean Square      F         Sig. 
---------------------------------------------------------------------
Regression    27499.012         1      27499.012    72.378    0.0000 
Residual      21276.434        56        379.936                     
Total         48775.446        57                                    
---------------------------------------------------------------------

                                   Parameter Estimates                                     
------------------------------------------------------------------------------------------
      model       Beta    Std. Error    Std. Beta      t        Sig       lower     upper 
------------------------------------------------------------------------------------------
(Intercept)    -31.250        12.815                 -2.439    0.018    -56.922    -5.579 
     height      0.613         0.072        0.751     8.508    0.000      0.468     0.757 
------------------------------------------------------------------------------------------

Two-sided Hypothesis Tests in Regression Output

(Intercept) Line: If the P-value (Sig) < \(\alpha\), we reject \(H_{0}\) and conclude that \(\beta_{0} \neq 0\)

\(H_{0}: \beta_{0} = 0\)

\(H_{A}: \beta_{0} \neq 0\)

height Line: If the P-value (Sig) < \(\alpha\), we reject \(H_{0}\) and conclude that \(\beta_{1} \neq 0\)

\(H_{0}: \beta_{1} = 0\)

\(H_{A}: \beta_{1} \neq 0\)

  • If the slope term (\(\beta_{1}\)) is non-zero, and the correlation is moderate to strong, there is a significant relationship between x and y.

Model of Star Wars Human Characters

  • Now we filter the Star Wars Data to ‘Humans’ and examine how it changes the correlation and the model.

Star Wars Human Character Regression Model Output

                         Model Summary                          
---------------------------------------------------------------
R                       0.536       RMSE                15.903 
R-Squared               0.288       MSE                252.896 
Adj. R-Squared          0.248       Coef. Var           20.616 
Pred R-Squared          0.083       AIC                173.417 
MAE                     9.758       SBC                176.404 
---------------------------------------------------------------
 RMSE: Root Mean Square Error 
 MSE: Mean Square Error 
 MAE: Mean Absolute Error 
 AIC: Akaike Information Criteria 
 SBC: Schwarz Bayesian Criteria 

                               ANOVA                                
-------------------------------------------------------------------
                Sum of                                             
               Squares        DF    Mean Square      F        Sig. 
-------------------------------------------------------------------
Regression    2042.989         1       2042.989    7.271    0.0148 
Residual      5057.929        18        280.996                    
Total         7100.918        19                                   
-------------------------------------------------------------------

                                    Parameter Estimates                                     
-------------------------------------------------------------------------------------------
      model       Beta    Std. Error    Std. Beta      t        Sig        lower     upper 
-------------------------------------------------------------------------------------------
(Intercept)    -81.773        60.598                 -1.349    0.194    -209.084    45.539 
     height      0.905         0.336        0.536     2.696    0.015       0.200     1.610 
-------------------------------------------------------------------------------------------

💥 Lecture 25 In-class Exercises - Q5-Q8 💥


Use the regression output and the data to answer the following questions.


  • Question 5: What is the correlation, \(r_{xy}\) between mass and height in the Star Wars Humans data?

  • Question 6: How many outliers are there in this data subset model, i.e. observations far from the others?

    • Follow-up: What is reasonable explanation from these outliers?

    • Hint: Hollywood culture

  • Question 7: Is the slope term, \(\beta_{1}\), significant? Assume \(\alpha = 0.05\).

  • Question 8: If a human character is 190 cm tall, what is their estimated height?