Housekeeping

HW 4 is due 2/12/2025

Quiz 1 Will Take Place on Thursday 2/20 in class

  • There is an asynchronous option.

FREE Posit Cloud Account

Today’s plan 📋

  • Continue discussion of Linear Regression Models in R.

    • Reading and interpreting regression output

    • Introduction to Multiple Linear Regression

  • New Skills from this week will not be on Quiz 1.

In-class Polling (Session ID: bua345s25)

💥 Lecture 9 In-class Exercise - Q1 💥

  • In lecture 8, we discussed the difference between a line function, f(x), and a simple linear regression model.

  • We discussed how a simple linear regression model looks just like a function,

    • BUT we interpret models differently.

    • Models are a simplification of real-world data.

  • We’ll start today with a model with a straightforward model estimate.

Model in R

hp_cty_mod <- lm(mpg_c ~ hp, data=gt_cars)
hp_cty_mod$coefficients
(Intercept)          hp 
23.93159715 -0.01653247 

\[\hat{y} = 23.9316 - 0.01653247x\]

  • Question 1: What is the City MPG for a vehicle with 800 horsepower?

R and RStudio

  • In this course we will use R and RStudio for the predictive analytics lectures.

  • You will access R and RStudio through Posit Cloud.

  • I will post R/RStudio files on Posit Cloud that you can access in provided links.

  • I will also provide demo videos that show how to access files and complete exercises.

  • NOTE: The free Posit Cloud account is limited to 25 hours per month.

    • I demo how to download completed work so that you can use this allotment efficiently.
  • We will also use Posit cloud for quiz questions of predictive analytics skills.

  • For those who want to download R and RStudio (not required):

💥 Lecture 9 In-class Exercises - Q2 💥

  • In Lecture 8, we also discussed residuals.

  • Residual: vertical distance between the model line (red line) and the observed y value for an individual observation.

  • Residuals indicate strength of the overall relationship and if there are outliers.

    • The smaller the residuals are, the strong the relationship is.

    • \(Residual = Y_{observed} - \hat{Y}\)

    • \(\hat{Y}\) is the estimated regression value of Y.

\[\hat{y} = 23.9316 - 0.01653247x\]

Model in R

hp_cty_mod <- lm(mpg_c ~ hp, data=gt_cars)
hp_cty_mod$coefficients
(Intercept)          hp 
23.93159715 -0.01653247 
  • Question 2: What is the City MPG residual for the BMW i8?

    • City MPG = 28
    • hp = 357

Simple Linear Regression Model

True Population Model

\[y_{i} = \beta_{0} + \beta_{1}x_{i} + e_{i}\]

  • \(\beta_{0}\) is the y-intercept

  • \(\beta_{1}\) is the slope

  • \(e\) is the unexplained variability in Y

Estimated Sample Data Model

\[\hat{y} = b_{0} + b_{1}x\]

  • \(\hat{y}\) is model estimate of y from x

  • \(b_{0}\) is model estimate of y-intercept

  • \(b_{1}\) is model estimate of slope

  • Each \(e_{i}\) is a residual.

    • y obs. - reg. estimate of y

    • \(e_{i} = y_{i} - \hat{y}_{i}\)

  • Software estimates model with smallest sum of all squared residuals

    • minimizes \(\sum_{i=1}^ne_{i}^2\)

Star Wars Example from Lecture 8

The plot and model show the relationship between height and mass for all Star Wars characters for whom data were available.

💥 Lecture 9 In-class Exercises - Q3-Q4 💥

  • Question 3 - Extrapolation

    • Can we use the Star Wars model to estimate the mass of a character that is 260 cm (8.5 feet) tall?


  • Question 4 - Interpolation

    • There are no characters in this dataset that are exactly 140 cm tall. Can we use this model to estimate the mass of a 140 cm (4.6 feet) character?

Model Assumptions and Limitations

  • A SLR model is only valid if for straight line relationships between X and Y.

    • Correlation should also be moderate to strong
  • If model is valid:

    • Model CAN be used to interpolate Y within the range of X used to build model.

    • MODEL CANNOT be used to extrapolate Y for an X outside of this range.

    • Why? … Because we don’t know if relationship is the same outside of this range.

Examining Regression Model Output

Correlation:

cor(sw$height, sw$mass)                # correlation
[1] 0.7508582

Specify Model:

sw_mod1 <- lm(mass ~ height, data=sw)  # specify model

Full Model Output Summary: Each line of model table is a hypothesis test.

summary(sw_mod1)                       # full model summary

Call:
lm(formula = mass ~ height, data = sw)

Residuals:
    Min      1Q  Median      3Q     Max 
-39.006  -7.804   0.508   4.007  57.901 

Coefficients:
             Estimate Std. Error t value        Pr(>|t|)    
(Intercept) -31.25047   12.81488  -2.439          0.0179 *  
height        0.61273    0.07202   8.508 0.0000000000114 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 19.49 on 56 degrees of freedom
Multiple R-squared:  0.5638,    Adjusted R-squared:  0.556 
F-statistic: 72.38 on 1 and 56 DF,  p-value: 0.00000000001138

SLR Model Output - More Readable

Sig below is the P-value for each term

\(\hat{Mass}=-31.25+0.613*Height\)

(sw_mod2 <- ols_regress(mass ~ height, data=sw))
                         Model Summary                           
----------------------------------------------------------------
R                        0.751       RMSE                19.153 
R-Squared                0.564       MSE                366.835 
Adj. R-Squared           0.556       Coef. Var           25.791 
Pred R-Squared           0.537       AIC                513.082 
MAE                     12.868       SBC                519.263 
----------------------------------------------------------------
 RMSE: Root Mean Square Error 
 MSE: Mean Square Error 
 MAE: Mean Absolute Error 
 AIC: Akaike Information Criteria 
 SBC: Schwarz Bayesian Criteria 

                                ANOVA                                 
---------------------------------------------------------------------
                 Sum of                                              
                Squares        DF    Mean Square      F         Sig. 
---------------------------------------------------------------------
Regression    27499.012         1      27499.012    72.378    0.0000 
Residual      21276.434        56        379.936                     
Total         48775.446        57                                    
---------------------------------------------------------------------

                                   Parameter Estimates                                     
------------------------------------------------------------------------------------------
      model       Beta    Std. Error    Std. Beta      t        Sig       lower     upper 
------------------------------------------------------------------------------------------
(Intercept)    -31.250        12.815                 -2.439    0.018    -56.922    -5.579 
     height      0.613         0.072        0.751     8.508    0.000      0.468     0.757 
------------------------------------------------------------------------------------------

Hypothesis Test on Each Line of Regression Output

  • Each line of the Regression Parameter Estimates table is two-sided hypothesis:

(Intercept) Line:

\(H_{0}: \beta_{0} = 0\)

\(H_{A}: \beta_{0} \neq 0\)

  • If the P-value < \(\alpha\), we reject \(H_{0}\) and conclude that \(\beta_{0} \neq 0\)

height Line:

\(H_{0}: \beta_{1} = 0\)

\(H_{A}: \beta_{1} \neq 0\)

  • If the P-value < \(\alpha\), we reject \(H_{0}\) and conclude that \(\beta_{1} \neq 0\)

  • If the slope term in non-zero, and the correlation is moderate to strong:

    • there is a relationship between x and y

Model of Star Wars Human Characters

  • Now lets make a change to the Star Wars Data and examine how it changes the correlation and the model.

  • Let’s limit the data to humans only

Star Wars Humans Regression Model Output

sw1 <- sw |> filter(species=="Human")
cor(sw1$height, sw1$mass)
[1] 0.5363839
(sw1_mod2 <- ols_regress(mass ~ height, data=sw1))
                         Model Summary                          
---------------------------------------------------------------
R                       0.536       RMSE                15.903 
R-Squared               0.288       MSE                252.896 
Adj. R-Squared          0.248       Coef. Var           20.616 
Pred R-Squared          0.083       AIC                173.417 
MAE                     9.758       SBC                176.404 
---------------------------------------------------------------
 RMSE: Root Mean Square Error 
 MSE: Mean Square Error 
 MAE: Mean Absolute Error 
 AIC: Akaike Information Criteria 
 SBC: Schwarz Bayesian Criteria 

                               ANOVA                                
-------------------------------------------------------------------
                Sum of                                             
               Squares        DF    Mean Square      F        Sig. 
-------------------------------------------------------------------
Regression    2042.989         1       2042.989    7.271    0.0148 
Residual      5057.929        18        280.996                    
Total         7100.918        19                                   
-------------------------------------------------------------------

                                    Parameter Estimates                                     
-------------------------------------------------------------------------------------------
      model       Beta    Std. Error    Std. Beta      t        Sig        lower     upper 
-------------------------------------------------------------------------------------------
(Intercept)    -81.773        60.598                 -1.349    0.194    -209.084    45.539 
     height      0.905         0.336        0.536     2.696    0.015       0.200     1.610 
-------------------------------------------------------------------------------------------

💥 Lecture 9 In-class Exercises - Q5-Q8 💥


Use the regression output and the data to answer the following questions.


  • Question 5: What is the correlation, \(r_{xy}\) between mass and height in the Star Wars Humans data?

  • Question 6: How many outliers are there in this data subset model, i.e. observations far from the others?

  • Question 7: Is the slope term, \(\beta_{1}\), significant? Assume \(\alpha = 0.05\).

  • Question 8: If a human character is 190 cm tall, what is their estimated height?

Introduction to Multiple Linear Regression

  • This regression model format can also be used if there multiple explanatory (X) variables.

  • If a model has more than one X variable, it is a MULTIPLE LINEAR REGRESSION model.

  • We will examine one more dataset today to introduce this concept.

  • First let’s import and examine the data:


real_estate <- read_csv("data/Real_Estate.csv", show_col_types = F) 
head(real_estate)
# A tibble: 6 × 4
   Price Living_Area Bathrooms House_Age
   <dbl>       <dbl>     <dbl>     <dbl>
1 217314        2498       2.5        14
2 238792        2250       2.5        10
3 222330        2712       3           1
4 206688        2284       2.5        17
5  88207        1480       1.5        14
6 236936        2300       2.5        16

💥 Lecture 9 In-class Exercises - Q9 💥

Below is the model output for a regression model relating the size of the living area of a house to it’s selling price.

What is the estimated selling price of a 2000 sq. ft. house, based on this model?

Round your answer to a whole dollar amount.

(house_mod1 <- ols_regress(Price ~ Living_Area, data=real_estate))
                              Model Summary                                
--------------------------------------------------------------------------
R                           0.772       RMSE                    45426.628 
R-Squared                   0.596       MSE                2063578544.951 
Adj. R-Squared              0.594       Coef. Var                  27.670 
Pred R-Squared              0.579       AIC                      4863.117 
MAE                     31692.288       SBC                      4873.012 
--------------------------------------------------------------------------
 RMSE: Root Mean Square Error 
 MSE: Mean Square Error 
 MAE: Mean Absolute Error 
 AIC: Akaike Information Criteria 
 SBC: Schwarz Bayesian Criteria 

                                       ANOVA                                         
------------------------------------------------------------------------------------
                         Sum of                                                     
                        Squares         DF         Mean Square       F         Sig. 
------------------------------------------------------------------------------------
Regression     609852999259.857          1    609852999259.857    292.576    0.0000 
Residual       412715708990.143        198      2084422772.677                      
Total         1022568708250.000        199                                          
------------------------------------------------------------------------------------

                                       Parameter Estimates                                        
-------------------------------------------------------------------------------------------------
      model         Beta    Std. Error    Std. Beta      t        Sig         lower        upper 
-------------------------------------------------------------------------------------------------
(Intercept)    16505.199      9262.237                  1.782    0.076    -1760.095    34770.493 
Living_Area       82.588         4.828        0.772    17.105    0.000       73.066       92.110 
-------------------------------------------------------------------------------------------------

💥 Lecture 9 In-class Exercises - Q9 con’t 💥

Focus on the Parameter Estimates table to answer this question:

house_mod1$betas |> round(3)
(Intercept) Living_Area 
  16505.199      82.588 

Regression Output Interpretation

  • \(Est. Selling Price = 16505.199 + 82.588\times Living Area\)

Limitations of Simple Linear Regression

Simple Linear Regression - One X variable

In this case, X is the size of the living area.

This model says that regardless of other factors

  • a 2500 sq. ft house has a selling price of 222975.

  • The model ignores number of bathrooms, age of house, etc.

  • These factors may also be helpful in explaining selling price.


  • Correlation between Living Area and Selling price is 0.77

  • This a is strong correlation, but maybe we can explain more of the variability in the data.

Simple Linear Regression vs. Multiple Linear Regression

Transitioning from SLR to MLR is Straightforward

  • In R and most software adding a variable to our model is as simple as addition.

  • The challenge is interpretation because we can no longer visualize the model.

  • There are 3-D visualization tools in R, BUT they are not always helpful.

  • Instead I recommend extending the SLR model output interpretation to the variables in the model.

  • One the next slide we’ll add number of bathrooms.

    • Spoiler: Number of bathrooms is a huge deal when buying a house.

MLR Model with Two Variables

(house_mod2 <- ols_regress(Price ~ Living_Area + Bathrooms, data=real_estate))
                              Model Summary                                
--------------------------------------------------------------------------
R                           0.815       RMSE                    41412.317 
R-Squared                   0.665       MSE                1714980011.473 
Adj. R-Squared              0.661       Coef. Var                  25.289 
Pred R-Squared              0.640       AIC                      4828.109 
MAE                     30629.922       SBC                      4841.302 
--------------------------------------------------------------------------
 RMSE: Root Mean Square Error 
 MSE: Mean Square Error 
 MAE: Mean Absolute Error 
 AIC: Akaike Information Criteria 
 SBC: Schwarz Bayesian Criteria 

                                       ANOVA                                         
------------------------------------------------------------------------------------
                         Sum of                                                     
                        Squares         DF         Mean Square       F         Sig. 
------------------------------------------------------------------------------------
Regression     679572705955.336          2    339786352977.668    195.157    0.0000 
Residual       342996002294.664        197      1741096458.349                      
Total         1022568708250.000        199                                          
------------------------------------------------------------------------------------

                                        Parameter Estimates                                         
---------------------------------------------------------------------------------------------------
      model          Beta    Std. Error    Std. Beta      t        Sig          lower        upper 
---------------------------------------------------------------------------------------------------
(Intercept)    -11553.295      9556.111                 -1.209    0.228    -30398.701     7292.110 
Living_Area        58.047         5.875        0.543     9.881    0.000        46.462       69.633 
  Bathrooms     38141.447      6027.411        0.348     6.328    0.000     26254.916    50027.977 
---------------------------------------------------------------------------------------------------

A closer look at the Parameter Estimates

Interpreting the New Model

Model: \[ Est. Selling Price = -11553.295 + 58.047\times Living Area + 38141.447 \times Bathrooms \]

Interpretation:

  • If number of bathrooms remains unchanged, each additional square foot is estimated to raise the selling price by about 58 dollars.

  • If living area remains unchanged, each additional bathroom will raise the estimated selling price by about 38 THOUSAND dollars.

💥 Preview of Next Lecture 💥

Based on this model, if a house is renovated to increase the square footage by 1000 square feet and two bathrooms are added, what would be estimated change in price?

Round your answer to a whole dollar amount.


Model: \[ Est. Selling Price = -11553.295 + 58.047\times Living Area + 38141.447 \times Bathrooms \]


house_mod2$betas |> round(3)
(Intercept) Living_Area   Bathrooms 
 -11553.295      58.047   38141.447 

Adding ANOTHER Term to our MLR

Next, we add age of the house to the model:

(house_mod3 <- ols_regress(Price ~ Living_Area + Bathrooms + House_Age, data=real_estate))
                              Model Summary                                
--------------------------------------------------------------------------
R                           0.821       RMSE                    40864.224 
R-Squared                   0.673       MSE                1669884825.573 
Adj. R-Squared              0.668       Coef. Var                  25.018 
Pred R-Squared              0.641       AIC                      4824.780 
MAE                     30119.407       SBC                      4841.271 
--------------------------------------------------------------------------
 RMSE: Root Mean Square Error 
 MSE: Mean Square Error 
 MAE: Mean Absolute Error 
 AIC: Akaike Information Criteria 
 SBC: Schwarz Bayesian Criteria 

                                       ANOVA                                         
------------------------------------------------------------------------------------
                         Sum of                                                     
                        Squares         DF         Mean Square       F         Sig. 
------------------------------------------------------------------------------------
Regression     688591743135.442          3    229530581045.147    134.704    0.0000 
Residual       333976965114.558        196      1703964107.727                      
Total         1022568708250.000        199                                          
------------------------------------------------------------------------------------

                                       Parameter Estimates                                         
--------------------------------------------------------------------------------------------------
      model         Beta    Std. Error    Std. Beta      t        Sig          lower        upper 
--------------------------------------------------------------------------------------------------
(Intercept)     5775.299     12087.330                  0.478    0.633    -18062.622    29613.220 
Living_Area       60.614         5.918        0.567    10.243    0.000        48.943       72.285 
  Bathrooms    30089.928      6913.944        0.274     4.352    0.000     16454.654    43725.201 
  House_Age     -235.721       102.458       -0.112    -2.301    0.022      -437.783      -33.658 
--------------------------------------------------------------------------------------------------

Examining the new model

Hopefully, the interpretation will seem redundant at this point…

The New Model

Model: \[ Est. Selling Price = 5775.299 + 60.614\times Living Area + 30089.928 \times Bathrooms - 235.721\times House Age \]

Interpretation:

  • If number of bathrooms and age of the house remain unchanged, each additional square foot is estimated to raise the selling price by about 61 dollars.

  • If living area and age of the house remain unchanged, each additional bathroom will raise the estimated selling price by about 30 THOUSAND dollars.

  • If living area and number of bathrooms remain unchanged, each additional year will LOWER the estimated selling price by about 236 dollars.

💥 Preview of Next Lecture 💥


What is the estimated price of a house that 2500 square feet with 4 bathrooms that is 20 years old?


house_mod3$betas |> round(3)
(Intercept) Living_Area   Bathrooms   House_Age 
   5775.299      60.614   30089.928    -235.721