BUA 345 - Lecture 9

More about Linear Regression Models in R

Penelope Pooler Eisenbies

2025-02-13

Housekeeping

HW 4 is due 2/12/2025

Quiz 1 Will Take Place on Thursday 2/20 in class

There is an asynchronous option.

FREE Posit Cloud Account

Today’s plan 📋

Continue discussion of Linear Regression Models in R.
- Reading and interpreting regression output
- Introduction to Multiple Linear Regression
New Skills from this week will not be on Quiz 1.

In-class Polling (Session ID: bua345s25)

💥 Lecture 9 In-class Exercise - Q1 💥

In lecture 8, we discussed the difference between a line function, f(x), and a simple linear regression model.
We discussed how a simple linear regression model looks just like a function,
- BUT we interpret models differently.
- Models are a simplification of real-world data.
We’ll start today with a model with a straightforward model estimate.

Model in R

hp_cty_mod <- lm(mpg_c ~ hp, data=gt_cars)
hp_cty_mod$coefficients

(Intercept)          hp 
23.93159715 -0.01653247

\[\hat{y} = 23.9316 - 0.01653247x\]

Question 1: What is the City MPG for a vehicle with 800 horsepower?

R and RStudio

In this course we will use R and RStudio for the predictive analytics lectures.
You will access R and RStudio through Posit Cloud.
- Sign up for a Free Posit Cloud Account
I will post R/RStudio files on Posit Cloud that you can access in provided links.
I will also provide demo videos that show how to access files and complete exercises.
NOTE: The free Posit Cloud account is limited to 25 hours per month.
- I demo how to download completed work so that you can use this allotment efficiently.
We will also use Posit cloud for quiz questions of predictive analytics skills.
For those who want to download R and RStudio (not required):
- There is an information page on my course website, Installing R and RStudio

💥 Lecture 9 In-class Exercises - Q2 💥

In Lecture 8, we also discussed residuals.
Residual: vertical distance between the model line (red line) and the observed y value for an individual observation.
Residuals indicate strength of the overall relationship and if there are outliers.
- The smaller the residuals are, the strong the relationship is.
- \(Residual = Y_{observed} - \hat{Y}\)
- \(\hat{Y}\) is the estimated regression value of Y.

\[\hat{y} = 23.9316 - 0.01653247x\]

Model in R

hp_cty_mod <- lm(mpg_c ~ hp, data=gt_cars)
hp_cty_mod$coefficients

(Intercept)          hp 
23.93159715 -0.01653247

Question 2: What is the City MPG residual for the BMW i8?
- City MPG = 28
- hp = 357

Simple Linear Regression Model

True Population Model

\[y_{i} = \beta_{0} + \beta_{1}x_{i} + e_{i}\]

\(\beta_{0}\) is the y-intercept
\(\beta_{1}\) is the slope
\(e\) is the unexplained variability in Y

Estimated Sample Data Model

\[\hat{y} = b_{0} + b_{1}x\]

\(\hat{y}\) is model estimate of y from x
\(b_{0}\) is model estimate of y-intercept
\(b_{1}\) is model estimate of slope

Each \(e_{i}\) is a residual.
- y obs. - reg. estimate of y
- \(e_{i} = y_{i} - \hat{y}_{i}\)
Software estimates model with smallest sum of all squared residuals
- minimizes \(\sum_{i=1}^ne_{i}^2\)

Star Wars Example from Lecture 8

The plot and model show the relationship between height and mass for all Star Wars characters for whom data were available.

💥 Lecture 9 In-class Exercises - Q3-Q4 💥

Question 3 - Extrapolation
- Can we use the Star Wars model to estimate the mass of a character that is 260 cm (8.5 feet) tall?

Question 4 - Interpolation
- There are no characters in this dataset that are exactly 140 cm tall. Can we use this model to estimate the mass of a 140 cm (4.6 feet) character?

Model Assumptions and Limitations

A SLR model is only valid if for straight line relationships between X and Y.
- Correlation should also be moderate to strong
If model is valid:
- Model CAN be used to interpolate Y within the range of X used to build model.
- MODEL CANNOT be used to extrapolate Y for an X outside of this range.
- Why? … Because we don’t know if relationship is the same outside of this range.

Examining Regression Model Output

Correlation:

cor(sw$height, sw$mass)                # correlation

[1] 0.7508582

Specify Model:

sw_mod1 <- lm(mass ~ height, data=sw)  # specify model

Full Model Output Summary: Each line of model table is a hypothesis test.

summary(sw_mod1)                       # full model summary


Call:
lm(formula = mass ~ height, data = sw)

Residuals:
    Min      1Q  Median      3Q     Max 
-39.006  -7.804   0.508   4.007  57.901 

Coefficients:
             Estimate Std. Error t value        Pr(>|t|)    
(Intercept) -31.25047   12.81488  -2.439          0.0179 *  
height        0.61273    0.07202   8.508 0.0000000000114 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 19.49 on 56 degrees of freedom
Multiple R-squared:  0.5638,    Adjusted R-squared:  0.556 
F-statistic: 72.38 on 1 and 56 DF,  p-value: 0.00000000001138

SLR Model Output - More Readable

Sig below is the P-value for each term

\(\hat{Mass}=-31.25+0.613*Height\)

(sw_mod2 <- ols_regress(mass ~ height, data=sw))

                         Model Summary                           
----------------------------------------------------------------
R                        0.751       RMSE                19.153 
R-Squared                0.564       MSE                366.835 
Adj. R-Squared           0.556       Coef. Var           25.791 
Pred R-Squared           0.537       AIC                513.082 
MAE                     12.868       SBC                519.263 
----------------------------------------------------------------
 RMSE: Root Mean Square Error 
 MSE: Mean Square Error 
 MAE: Mean Absolute Error 
 AIC: Akaike Information Criteria 
 SBC: Schwarz Bayesian Criteria 

                                ANOVA                                 
---------------------------------------------------------------------
                 Sum of                                              
                Squares        DF    Mean Square      F         Sig. 
---------------------------------------------------------------------
Regression    27499.012         1      27499.012    72.378    0.0000 
Residual      21276.434        56        379.936                     
Total         48775.446        57                                    
---------------------------------------------------------------------

                                   Parameter Estimates                                     
------------------------------------------------------------------------------------------
      model       Beta    Std. Error    Std. Beta      t        Sig       lower     upper 
------------------------------------------------------------------------------------------
(Intercept)    -31.250        12.815                 -2.439    0.018    -56.922    -5.579 
     height      0.613         0.072        0.751     8.508    0.000      0.468     0.757 
------------------------------------------------------------------------------------------

Hypothesis Test on Each Line of Regression Output

Each line of the Regression Parameter Estimates table is two-sided hypothesis:

(Intercept) Line:

\(H_{0}: \beta_{0} = 0\)

\(H_{A}: \beta_{0} \neq 0\)

If the P-value < \(\alpha\), we reject \(H_{0}\) and conclude that \(\beta_{0} \neq 0\)

height Line:

\(H_{0}: \beta_{1} = 0\)

\(H_{A}: \beta_{1} \neq 0\)

If the P-value < \(\alpha\), we reject \(H_{0}\) and conclude that \(\beta_{1} \neq 0\)
If the slope term in non-zero, and the correlation is moderate to strong:
- there is a relationship between x and y

Model of Star Wars Human Characters

Now lets make a change to the Star Wars Data and examine how it changes the correlation and the model.
Let’s limit the data to humans only

Star Wars Humans Regression Model Output

sw1 <- sw |> filter(species=="Human")
cor(sw1$height, sw1$mass)

[1] 0.5363839

(sw1_mod2 <- ols_regress(mass ~ height, data=sw1))

                         Model Summary                          
---------------------------------------------------------------
R                       0.536       RMSE                15.903 
R-Squared               0.288       MSE                252.896 
Adj. R-Squared          0.248       Coef. Var           20.616 
Pred R-Squared          0.083       AIC                173.417 
MAE                     9.758       SBC                176.404 
---------------------------------------------------------------
 RMSE: Root Mean Square Error 
 MSE: Mean Square Error 
 MAE: Mean Absolute Error 
 AIC: Akaike Information Criteria 
 SBC: Schwarz Bayesian Criteria 

                               ANOVA                                
-------------------------------------------------------------------
                Sum of                                             
               Squares        DF    Mean Square      F        Sig. 
-------------------------------------------------------------------
Regression    2042.989         1       2042.989    7.271    0.0148 
Residual      5057.929        18        280.996                    
Total         7100.918        19                                   
-------------------------------------------------------------------

                                    Parameter Estimates                                     
-------------------------------------------------------------------------------------------
      model       Beta    Std. Error    Std. Beta      t        Sig        lower     upper 
-------------------------------------------------------------------------------------------
(Intercept)    -81.773        60.598                 -1.349    0.194    -209.084    45.539 
     height      0.905         0.336        0.536     2.696    0.015       0.200     1.610 
-------------------------------------------------------------------------------------------

💥 Lecture 9 In-class Exercises - Q5-Q8 💥

Use the regression output and the data to answer the following questions.

Question 5: What is the correlation, \(r_{xy}\) between mass and height in the Star Wars Humans data?
Question 6: How many outliers are there in this data subset model, i.e. observations far from the others?
Question 7: Is the slope term, \(\beta_{1}\), significant? Assume \(\alpha = 0.05\).
Question 8: If a human character is 190 cm tall, what is their estimated height?

Introduction to Multiple Linear Regression

This regression model format can also be used if there multiple explanatory (X) variables.
If a model has more than one X variable, it is a MULTIPLE LINEAR REGRESSION model.
We will examine one more dataset today to introduce this concept.
First let’s import and examine the data:

real_estate <- read_csv("data/Real_Estate.csv", show_col_types = F) 
head(real_estate)

# A tibble: 6 × 4
   Price Living_Area Bathrooms House_Age
   <dbl>       <dbl>     <dbl>     <dbl>
1 217314        2498       2.5        14
2 238792        2250       2.5        10
3 222330        2712       3           1
4 206688        2284       2.5        17
5  88207        1480       1.5        14
6 236936        2300       2.5        16

💥 Lecture 9 In-class Exercises - Q9 💥

Below is the model output for a regression model relating the size of the living area of a house to it’s selling price.

What is the estimated selling price of a 2000 sq. ft. house, based on this model?

Round your answer to a whole dollar amount.

(house_mod1 <- ols_regress(Price ~ Living_Area, data=real_estate))

                              Model Summary                                
--------------------------------------------------------------------------
R                           0.772       RMSE                    45426.628 
R-Squared                   0.596       MSE                2063578544.951 
Adj. R-Squared              0.594       Coef. Var                  27.670 
Pred R-Squared              0.579       AIC                      4863.117 
MAE                     31692.288       SBC                      4873.012 
--------------------------------------------------------------------------
 RMSE: Root Mean Square Error 
 MSE: Mean Square Error 
 MAE: Mean Absolute Error 
 AIC: Akaike Information Criteria 
 SBC: Schwarz Bayesian Criteria 

                                       ANOVA                                         
------------------------------------------------------------------------------------
                         Sum of                                                     
                        Squares         DF         Mean Square       F         Sig. 
------------------------------------------------------------------------------------
Regression     609852999259.857          1    609852999259.857    292.576    0.0000 
Residual       412715708990.143        198      2084422772.677                      
Total         1022568708250.000        199                                          
------------------------------------------------------------------------------------

                                       Parameter Estimates                                        
-------------------------------------------------------------------------------------------------
      model         Beta    Std. Error    Std. Beta      t        Sig         lower        upper 
-------------------------------------------------------------------------------------------------
(Intercept)    16505.199      9262.237                  1.782    0.076    -1760.095    34770.493 
Living_Area       82.588         4.828        0.772    17.105    0.000       73.066       92.110 
-------------------------------------------------------------------------------------------------

💥 Lecture 9 In-class Exercises - Q9 con’t 💥

Focus on the Parameter Estimates table to answer this question:

house_mod1$betas |> round(3)

(Intercept) Living_Area 
  16505.199      82.588

Regression Output Interpretation

\(Est. Selling Price = 16505.199 + 82.588\times Living Area\)

Limitations of Simple Linear Regression

Simple Linear Regression - One X variable

In this case, X is the size of the living area.

This model says that regardless of other factors

a 2500 sq. ft house has a selling price of 222975.
The model ignores number of bathrooms, age of house, etc.
These factors may also be helpful in explaining selling price.

Correlation between Living Area and Selling price is 0.77
This a is strong correlation, but maybe we can explain more of the variability in the data.

Simple Linear Regression vs. Multiple Linear Regression

Transitioning from SLR to MLR is Straightforward

In R and most software adding a variable to our model is as simple as addition.
The challenge is interpretation because we can no longer visualize the model.
There are 3-D visualization tools in R, BUT they are not always helpful.
Instead I recommend extending the SLR model output interpretation to the variables in the model.
One the next slide we’ll add number of bathrooms.
- Spoiler: Number of bathrooms is a huge deal when buying a house.

MLR Model with Two Variables

(house_mod2 <- ols_regress(Price ~ Living_Area + Bathrooms, data=real_estate))

                              Model Summary                                
--------------------------------------------------------------------------
R                           0.815       RMSE                    41412.317 
R-Squared                   0.665       MSE                1714980011.473 
Adj. R-Squared              0.661       Coef. Var                  25.289 
Pred R-Squared              0.640       AIC                      4828.109 
MAE                     30629.922       SBC                      4841.302 
--------------------------------------------------------------------------
 RMSE: Root Mean Square Error 
 MSE: Mean Square Error 
 MAE: Mean Absolute Error 
 AIC: Akaike Information Criteria 
 SBC: Schwarz Bayesian Criteria 

                                       ANOVA                                         
------------------------------------------------------------------------------------
                         Sum of                                                     
                        Squares         DF         Mean Square       F         Sig. 
------------------------------------------------------------------------------------
Regression     679572705955.336          2    339786352977.668    195.157    0.0000 
Residual       342996002294.664        197      1741096458.349                      
Total         1022568708250.000        199                                          
------------------------------------------------------------------------------------

                                        Parameter Estimates                                         
---------------------------------------------------------------------------------------------------
      model          Beta    Std. Error    Std. Beta      t        Sig          lower        upper 
---------------------------------------------------------------------------------------------------
(Intercept)    -11553.295      9556.111                 -1.209    0.228    -30398.701     7292.110 
Living_Area        58.047         5.875        0.543     9.881    0.000        46.462       69.633 
  Bathrooms     38141.447      6027.411        0.348     6.328    0.000     26254.916    50027.977 
---------------------------------------------------------------------------------------------------

A closer look at the Parameter Estimates

Interpreting the New Model

Model: \[ Est. Selling Price = -11553.295 + 58.047\times Living Area + 38141.447 \times Bathrooms \]

Interpretation:

If number of bathrooms remains unchanged, each additional square foot is estimated to raise the selling price by about 58 dollars.
If living area remains unchanged, each additional bathroom will raise the estimated selling price by about 38 THOUSAND dollars.

💥 Preview of Next Lecture 💥

Based on this model, if a house is renovated to increase the square footage by 1000 square feet and two bathrooms are added, what would be estimated change in price?

Round your answer to a whole dollar amount.

Model: \[ Est. Selling Price = -11553.295 + 58.047\times Living Area + 38141.447 \times Bathrooms \]

house_mod2$betas |> round(3)

(Intercept) Living_Area   Bathrooms 
 -11553.295      58.047   38141.447

Adding ANOTHER Term to our MLR

Next, we add age of the house to the model:

(house_mod3 <- ols_regress(Price ~ Living_Area + Bathrooms + House_Age, data=real_estate))

                              Model Summary                                
--------------------------------------------------------------------------
R                           0.821       RMSE                    40864.224 
R-Squared                   0.673       MSE                1669884825.573 
Adj. R-Squared              0.668       Coef. Var                  25.018 
Pred R-Squared              0.641       AIC                      4824.780 
MAE                     30119.407       SBC                      4841.271 
--------------------------------------------------------------------------
 RMSE: Root Mean Square Error 
 MSE: Mean Square Error 
 MAE: Mean Absolute Error 
 AIC: Akaike Information Criteria 
 SBC: Schwarz Bayesian Criteria 

                                       ANOVA                                         
------------------------------------------------------------------------------------
                         Sum of                                                     
                        Squares         DF         Mean Square       F         Sig. 
------------------------------------------------------------------------------------
Regression     688591743135.442          3    229530581045.147    134.704    0.0000 
Residual       333976965114.558        196      1703964107.727                      
Total         1022568708250.000        199                                          
------------------------------------------------------------------------------------

                                       Parameter Estimates                                         
--------------------------------------------------------------------------------------------------
      model         Beta    Std. Error    Std. Beta      t        Sig          lower        upper 
--------------------------------------------------------------------------------------------------
(Intercept)     5775.299     12087.330                  0.478    0.633    -18062.622    29613.220 
Living_Area       60.614         5.918        0.567    10.243    0.000        48.943       72.285 
  Bathrooms    30089.928      6913.944        0.274     4.352    0.000     16454.654    43725.201 
  House_Age     -235.721       102.458       -0.112    -2.301    0.022      -437.783      -33.658 
--------------------------------------------------------------------------------------------------

Examining the new model

Hopefully, the interpretation will seem redundant at this point…

The New Model

Model: \[ Est. Selling Price = 5775.299 + 60.614\times Living Area + 30089.928 \times Bathrooms - 235.721\times House Age \]

Interpretation:

If number of bathrooms and age of the house remain unchanged, each additional square foot is estimated to raise the selling price by about 61 dollars.
If living area and age of the house remain unchanged, each additional bathroom will raise the estimated selling price by about 30 THOUSAND dollars.
If living area and number of bathrooms remain unchanged, each additional year will LOWER the estimated selling price by about 236 dollars.

💥 Preview of Next Lecture 💥

What is the estimated price of a house that 2500 square feet with 4 bathrooms that is 20 years old?

house_mod3$betas |> round(3)

(Intercept) Living_Area   Bathrooms   House_Age 
   5775.299      60.614   30089.928    -235.721

Key Points from Today

Simple linear regression (SLR) models are similar in format to the function of line.
The interpretation is different because SLR models are a simplification of the real world.
A model is only valid for the range data used to create it.
- Outside of that range we are extrapolating which is invalid.
Regression model output includes hypothesis tests of each model coefficient.
- For SLR, the hypothesis test of \(\beta_{1}\) is an indication of the validity of the model.
Multiple Linear Regression (MLR) is an extension of SLR where we ADD more variables to the model.
- For MLR, the hypothesis test of each \(\beta\) is an indication of whether or not that variable is useful to the model.

HW 4 is due 2/12/2025

To submit an Engagement Question or Comment about material from Lecture 9: Submit it by midnight today (day of lecture).