Lecture 12 - Multiple Linear Regression Models in R

Penelope Pooler Eisenbies
BUA 345

2024-02-22

Housekeeping

  • Today’s plan 📋

    • Review of SLR and MLR model assumptions

    • Review of Normal Distribution

    • Review of LN Transformation

    • SLR Model Output and Multiple Linear Regression (MLR)

    • Examining Regression Model Output

      • Understanding hypotheses being tested

      • Interpreting regression model output

    • Introduction to Multiple Linear Regression

      • Adding to a model

      • Interpreting model output

    • Working Through HW 5

Regression Model Assumptions

  1. Review: For simple linear regression, there must be a linear relationship between X, the explanatory variable, and Y, the response variable.

  2. The response variable, must be approximately normally distributed.

    • Recall that normally distributed means symmetric and bell-shaped.

    • What if it’s not.

    • One common solution is a linear transformation.

Financial data such as real estate data, prices, etc. are commonly right-skewed.

  • A good transformation for right-skwed data is the Natural Log (LN) Transformation.

  • In HW 5 we work through:

    • How the LN transformation ‘normalizes’ the distribution of the response.

    • How to ‘back-transform’ model results to return to original scale of the data, e.g. US dollars.

Review of Histograms of Different Distributions

Histograms are an effective tool for examining the distribution of the data.

LEFT SKEWED

Tail pulled out to LEFT

Low outliers

e.g. Human Lifespan

NORMAL/SYMMETRIC

Data appear in a symmetric bell-shaped curve

No graphic evidence of outliers

e.g. Test scores

RIGHT SKEWED

Tail pulled out to RIGHT

High outliers

e.g. Real Estate Data

💥 Lecture 12 In-class Exercises - Q1 - Review 💥

Below is the model output for a regression model relating the size of the living area of a house to it’s selling price.

What is the estimated selling price of a 2300 sq. ft. house, based on this model?

Round your answer to a whole dollar amount.real_estate <- read_csv(“data/Real_Estate.csv”, show_col_types = F)

(house_mod1 <- ols_regress(Price ~ Living_Area, data=real_estate))
                            Model Summary                              
----------------------------------------------------------------------
R                       0.772       RMSE                    45655.479 
R-Squared               0.596       Coef. Var                  27.670 
Adj. R-Squared          0.594       MSE                2084422772.677 
Pred R-Squared          0.579       MAE                     31692.288 
----------------------------------------------------------------------
 RMSE: Root Mean Square Error 
 MSE: Mean Square Error 
 MAE: Mean Absolute Error 

                                       ANOVA                                         
------------------------------------------------------------------------------------
                         Sum of                                                     
                        Squares         DF         Mean Square       F         Sig. 
------------------------------------------------------------------------------------
Regression     609852999259.857          1    609852999259.857    292.576    0.0000 
Residual       412715708990.143        198      2084422772.677                      
Total         1022568708250.000        199                                          
------------------------------------------------------------------------------------

                                       Parameter Estimates                                        
-------------------------------------------------------------------------------------------------
      model         Beta    Std. Error    Std. Beta      t        Sig         lower        upper 
-------------------------------------------------------------------------------------------------
(Intercept)    16505.199      9262.237                  1.782    0.076    -1760.095    34770.493 
Living_Area       82.588         4.828        0.772    17.105    0.000       73.066       92.110 
-------------------------------------------------------------------------------------------------

💥 Lecture 12 In-class Exercises - Q1 con’t 💥

Focus on the Parameter Estimates table to answer this question:

house_mod1$betas |> round(3)
(Intercept) Living_Area 
  16505.199      82.588 

Regression Output Interpretation

  • \(Est. Selling Price = 16505.199 + 82.588\times Living Area\)

Limitations of Simple Linear Regression

Simple Linear Regression - One X variable

In this case, X is the size of the living area.

This model says that regardless of other factors

  • a 2500 sq. ft house has a selling price of 222975.

  • The model ignores number of bathrooms, age of house, etc.

  • These factors may also be helpful in explaining selling price.


  • Correlation between Living Area and Selling price is 0.77

  • This a is strong correlation, but maybe we can explain more of the variability in the data.

Introduction to Multiple Linear Regression

  • This linear regression model format can also be used if there multiple explanatory (X) variables.

  • If a model has more than one X variable, it is a MULTIPLE LINEAR REGRESSION model.

  • We will examine one more dataset today to introduce this concept.

  • First let’s import and examine the data:


real_estate <- read_csv("data/Real_Estate.csv", show_col_types = F) 
head(real_estate)
# A tibble: 6 × 4
   Price Living_Area Bathrooms House_Age
   <dbl>       <dbl>     <dbl>     <dbl>
1 217314        2498       2.5        14
2 238792        2250       2.5        10
3 222330        2712       3           1
4 206688        2284       2.5        17
5  88207        1480       1.5        14
6 236936        2300       2.5        16

Simple Linear Regression vs. Multiple Linear Regression

Transitioning from SLR to MLR is Straightforward

  • In R and most software adding a variable to our model is as simple as addition.

  • The challenge is interpretation because we can no longer visualize the model.

  • There are 3-D visualization tools in R, BUT they are not always helpful.

  • Instead I recommend extending the SLR model output interpretation to the variables in the model.

  • One the next slide we’ll add number of bathrooms.

    • Spoiler: Number of bathrooms is a huge deal when buying a house.

MLR Model with Two Variables

(house_mod2 <- ols_regress(Price ~ Living_Area + Bathrooms, data=real_estate))
                            Model Summary                              
----------------------------------------------------------------------
R                       0.815       RMSE                    41726.448 
R-Squared               0.665       Coef. Var                  25.289 
Adj. R-Squared          0.661       MSE                1741096458.349 
Pred R-Squared          0.640       MAE                     30629.922 
----------------------------------------------------------------------
 RMSE: Root Mean Square Error 
 MSE: Mean Square Error 
 MAE: Mean Absolute Error 

                                       ANOVA                                         
------------------------------------------------------------------------------------
                         Sum of                                                     
                        Squares         DF         Mean Square       F         Sig. 
------------------------------------------------------------------------------------
Regression     679572705955.336          2    339786352977.668    195.157    0.0000 
Residual       342996002294.664        197      1741096458.349                      
Total         1022568708250.000        199                                          
------------------------------------------------------------------------------------

                                        Parameter Estimates                                         
---------------------------------------------------------------------------------------------------
      model          Beta    Std. Error    Std. Beta      t        Sig          lower        upper 
---------------------------------------------------------------------------------------------------
(Intercept)    -11553.295      9556.111                 -1.209    0.228    -30398.701     7292.110 
Living_Area        58.047         5.875        0.543     9.881    0.000        46.462       69.633 
  Bathrooms     38141.447      6027.411        0.348     6.328    0.000     26254.916    50027.977 
---------------------------------------------------------------------------------------------------

A closer look at the Parameter Estimates

Interpreting the New Model

Model: \[ Est. Selling Price = -11553.295 + 58.047\times Living Area + 38141.447 \times Bathrooms \]

Interpretation:

  • If number of bathrooms remains unchanged, each additional square foot is estimated to raise the selling price by about 58 dollars.

  • If living area remains unchanged, each additional bathroom will raise the estimated selling price by about 38 THOUSAND dollars.

💥 Lecture 12 In-class Exercises - Q2 💥

Based on this model, if a house is renovated to increase the square footage by 1000 square feet and two bathrooms are added, what would be estimated change in price?

Round your answer to a whole dollar amount.


Model: \[ Est. Selling Price = -11553.295 + 58.047\times Living Area + 38141.447 \times Bathrooms \]


house_mod2$betas |> round(3)
(Intercept) Living_Area   Bathrooms 
 -11553.295      58.047   38141.447 

Adding ANOTHER Term to our MLR

Next, we add age of the house to the model:

(house_mod3 <- ols_regress(Price ~ Living_Area + Bathrooms + House_Age, data=real_estate))
                            Model Summary                              
----------------------------------------------------------------------
R                       0.821       RMSE                    41279.100 
R-Squared               0.673       Coef. Var                  25.018 
Adj. R-Squared          0.668       MSE                1703964107.727 
Pred R-Squared          0.641       MAE                     30119.407 
----------------------------------------------------------------------
 RMSE: Root Mean Square Error 
 MSE: Mean Square Error 
 MAE: Mean Absolute Error 

                                       ANOVA                                         
------------------------------------------------------------------------------------
                         Sum of                                                     
                        Squares         DF         Mean Square       F         Sig. 
------------------------------------------------------------------------------------
Regression     688591743135.442          3    229530581045.147    134.704    0.0000 
Residual       333976965114.558        196      1703964107.727                      
Total         1022568708250.000        199                                          
------------------------------------------------------------------------------------

                                       Parameter Estimates                                         
--------------------------------------------------------------------------------------------------
      model         Beta    Std. Error    Std. Beta      t        Sig          lower        upper 
--------------------------------------------------------------------------------------------------
(Intercept)     5775.299     12087.330                  0.478    0.633    -18062.622    29613.220 
Living_Area       60.614         5.918        0.567    10.243    0.000        48.943       72.285 
  Bathrooms    30089.928      6913.944        0.274     4.352    0.000     16454.654    43725.201 
  House_Age     -235.721       102.458       -0.112    -2.301    0.022      -437.783      -33.658 
--------------------------------------------------------------------------------------------------

Examining the new model

Hopefully, the interpretation will seem redundant at this point…

The New Model

Model: \[ Est. Selling Price = 5775.299 + 60.614\times Living Area + 30089.928 \times Bathrooms - 235.721\times House Age \]

Interpretation:

  • If number of bathrooms and age of the house remain unchanged, each additional square foot is estimated to raise the selling price by about 61 dollars.

  • If living area and age of the house remain unchanged, each additional bathroom will raise the estimated selling price by about 30 THOUSAND dollars.

  • If living area and number of bathrooms remain unchanged, each additional year will LOWER the estimated selling price by about 236 dollars.

💥 Lecture 12 In-class Exercises - Q3 💥


What is the estimated price of a house that 2500 square feet with 4 bathrooms that is 20 years old?


house_mod3$betas |> round(3)
(Intercept) Living_Area   Bathrooms   House_Age 
   5775.299      60.614   30089.928    -235.721 

Introduction to HW 5

  • Today we will focus on

    • How to navigate AND edit Quarto (.qmd) files

    • Getting started on HW 5.

    • Demonstration and Explanation of a Natural Log transformation

    • Two more In-class Exercises

💥 Lecture 12 In-class Exercises - Q4 and Q5 💥

In HW 5, you create a new variable, ln_Charges the natural log of Charges. Charges are the medical insurance charges for people in the dataset.

Based on the summary values and this histograms of these two variables, answer the following questions.

Question 4. The variable Charges is

  1. left-skewed

  2. normally distributed

  3. right-skewed

Question 5. The transformed variable ln_Charges is

  1. left-skewed

  2. normally distributed

  3. right-skewed

Key Points from Today

  • Multiple Linear Regression (MLR) is an extension of SLR where we ADD more variables to the model.

    • For MLR, the hypothesis test of each \(\beta\) is an indication of whether or not that variable is useful to the model.
  • A key assumption of SLR and MLR is that the response, Y, is normally distributed.

  • If the response is right-skewed which is common in data having to do with money, a good strategy is to use a natural log transformation.

  • This process is illustrated in HW 5.

To submit an Engagement Question or Comment about material from Lecture 12: Submit by midnight today (day of lecture). Click on Link next to the under Lecture 12