Lecture 19 - Quiz 2 Review

Penelope Pooler Eisenbies
BUA 345

2024-03-26

Housekeeping

  • HW 8 (Parts 1 and 2) was due on Monday (3/25)

    • Part 1 of HW 8 pertained to Lectures 15 - 17

    • Part 2 of HW 8 pertains to today’s lecture on Logistic Regression

    • Grace period ends tonight (Tue 3/26) at midnight.

  • Quiz 2 is Thursday, March 28th

    • There will be an asynchronous option.
  • Practice Questions and Demo Videos are available.

  • Quiz 2 is primarily based on material from

    • Lectures 9 - 18

    • HW Assignments 5, 6, 7, 8 Pt. 1, 8 Pt. 2

Lectures 9 - 11 (HW 5)

Correlation, SLR, and MLR

  • Simple Linear Regression and Multiple Linear Regression
  • How to calculate and interpret a correlation matrix in R
  • Review of Scatterplot Matrices
# imported dataset is saved as an object named houses
houses <- read_csv("data/houses.csv", show_col_types=F) |> glimpse(width=60)
Rows: 200
Columns: 4
$ Price     <dbl> 217314, 238792, 222330, 206688, 88207, 2…
$ Area      <dbl> 2498, 2250, 2712, 2284, 1480, 2300, 957,…
$ Bathrooms <dbl> 2.5, 2.5, 3.0, 2.5, 1.5, 2.5, 1.0, 2.0, …
$ Age       <dbl> 14, 10, 1, 17, 14, 16, 49, 18, 88, 49, 3…

💥 Lecture 19 In-class Exercises - Q1 and Q2 💥

Session ID: bua345s24

Question 1:

What is the correlation between House_Age and Living_Area in the houses dataset?

Question 2:

Are there any multicollinear variables in the following dataset?


houses |> cor() |> round(2) # correlation matrix
          Price  Area Bathrooms   Age
Price      1.00  0.77      0.71 -0.38
Area       0.77  1.00      0.66 -0.22
Bathrooms  0.71  0.66      1.00 -0.52
Age       -0.38 -0.22     -0.52  1.00

Scatterplot matrices graphically display the information in the correlation matrix.

Specifing a SLR or MLR model in R

Model specified with ols_regress in the olsrr package OR with lm (base R command)

  • Model format is always the same

  • Interpretation of \(R^2\) in SLR

(houses_slr <- ols_regress(Price ~ Area, data = houses))
                              Model Summary                                
--------------------------------------------------------------------------
R                           0.772       RMSE                    45426.628 
R-Squared                   0.596       MSE                2084422772.677 
Adj. R-Squared              0.594       Coef. Var                  27.670 
Pred R-Squared              0.579       AIC                      4863.117 
MAE                     31692.288       SBC                      4873.012 
--------------------------------------------------------------------------
 RMSE: Root Mean Square Error 
 MSE: Mean Square Error 
 MAE: Mean Absolute Error 
 AIC: Akaike Information Criteria 
 SBC: Schwarz Bayesian Criteria 

                                       ANOVA                                         
------------------------------------------------------------------------------------
                         Sum of                                                     
                        Squares         DF         Mean Square       F         Sig. 
------------------------------------------------------------------------------------
Regression     609852999259.857          1    609852999259.857    292.576    0.0000 
Residual       412715708990.143        198      2084422772.677                      
Total         1022568708250.000        199                                          
------------------------------------------------------------------------------------

                                       Parameter Estimates                                        
-------------------------------------------------------------------------------------------------
      model         Beta    Std. Error    Std. Beta      t        Sig         lower        upper 
-------------------------------------------------------------------------------------------------
(Intercept)    16505.199      9262.237                  1.782    0.076    -1760.095    34770.493 
       Area       82.588         4.828        0.772    17.105    0.000       73.066       92.110 
-------------------------------------------------------------------------------------------------

💥 Lecture 19 In-class Exercises - Q3 💥

Session ID: bua345s24

The correlation between Selling_Price and Living_Area is 0.772 and the \(R^2\) for the SLR model is 0.596.

What proportion of the variability in selling price is explained by living area?

💥 Lecture 19 In-class Exercises - Q4 💥

Session ID: bua345s24

Residual = Observed Y - Est. Y = Model Response - Model Estimate*

What is the residual for the second house shown in the data below?

houses_mlr <- ols_regress(Price ~ Area + Bathrooms + Age, data = houses) # specify model

houses <- houses |>
  mutate(Est_Selling_Price = lm(houses_mlr$model) |> predict(houses) |> round()) # add regression estimates

head(houses, 4) |> kable()
Price Area Bathrooms Age Est_Selling_Price
217314 2498 2.5 14 229114
238792 2250 2.5 10 215025
222330 2712 3.0 1 260195
206688 2284 2.5 17 215436

Additional Questions about MLR

(Not in PointSolutions)

  • Why is the natural log (LN) transformation of Y is sometimes needed?

    • Recall in R the command to do this is log. In excel it is ln

    • How do we back transform estimates from a model when LN(Y) is the response?

      • Can be done in Excel or R using exp function
  • How to interpret Multiple Linear Regression output

    • What hypothesis is being tested in each line of output?

    • What do we conclude if the P-value (sometimes labeled Sig) is greater than 0.05?

    • Note that in Backward Elimination we set a P-value cutoff of 0.1 (prem = 0.1), but we can later exclude variables when determining the final model.

    • Also note that Backward Elimination can alternatively be done using AIC or Adjusted \(R^2\).

Lectures 13 and 14 (HW 6)

Categorical Regression - Parallel Lines Model

  • How do we determine if there two or more separate intercepts?

  • NOTE that slopes for ALL categories are the same in a parallel lines model.

HW 6 Remodeled Houses Model Equations:

  • Model for un-remodeled Houses:

    • Price = 166419.209 + 118.14*Square_Feet
  • For Remodeled Houses combine baseline intercept with difference due to remodeling (RemodeledYes)

  • Model for Remodeled Houses:

    • Price = 166419.209 + 118.14*Square_Feet + 90325.284

    • Price = (166419.209 + 90325.284) + 118.14*Square_Feet

    • Price = 256744.5 + 118.14*Square_Feet

Lectures 13 and 14 (HW 6)

Categorical Regression - Interaction Model (Practice Questions 15 - 21)

  • How do we determine if there two or more separate intercepts?
  • How is this model different from Parallel Lines Model
  • How do we determine if there two or more different slopes?

HW 6 Diamonds Model Equations:

  • Model for Colorless Diamonds:

    • Price = -4446.56 + 10476.13*Weight
  • Model for Faint Yellow Diamonds:

    • Price = -4446.56 + 10476.13*Weight + 3464.41 - 6670.53*Weight
    • Price = (-4446.56 + 3464.41) + (10476.13 - 6670.53)*Weight
    • Price = -982.15 + 3805.6*Weight

Lectures 15 - 17 (HW 8 - Part 1)

Model Selection

  • Examining Data using Correlation and Scatterplot Matrices (See above)

  • Definition of Multicollinearity and how to determine if two variables are multicollinear

  • Definitions and R commands for the following methods

    • Backward Elimination, Forward Selection, and Stepwise Selection
    • Best Subsets (AIC, Mallows C(p), Adjusted \(R^2\), RMSE)
  • Interpreting Measures of Model Fit

    • Adjusted \(R^2\), AIC, Mallow’s C(p), RMSE
  • Interpreting Final Model

    • Same as for other MLR models and SLR models
    • Remember to back transform estimate if LN transformation is used
    • Residual = Observed Y - Estimate of Y

Lecture 18 (HW 8 - Part 2) - Logistic Regression

  • Definition of Odds: Odds is the ratio of the probability of an event occurring to the probability of it not occurring.

    • Recall that Odds can be calculated from probability

    • Probability is denoted as P or P(Event), e.g. P(Late Payment)

      • \(Odds = \frac{P(Event)}{1-P(Event)} = \frac{P}{1-P}\)
  • Converting Odds to Probability

    • \(P = \frac{Odds}{1+Odds}\)
  • LN Odds are used as link function in Logistic Regression

Logistic Regression

  • Logistic Regression is used when Y is binary, a categorical variable with two categories such as:

    • Yes or No
    • Passed or Failed
    • Survived or Not Survived (Titanic Example in Lecture 17)
    • Late Payment or Not (Examples in Lecture 17, HW 8, and Practice Questions)
  • We specify the Logistic Regression Model in almost the same way as a MLR model EXCEPT we use glm(generalized linear model) instead of lm (linear model).

    • GLM relaxes the LM assumption that the response is quantitative and normal.

Logistic Regression and Back Transforming Estimates

  • Estimated Response, Y’, is the LN Odds of an Event

  • Convert LN Odds, Y’ to Probability as: \(P = \frac{e^{Y'}}{1 + e{Y'}}\)

    • Recall that in R and Excel, \(e^{x}\) is calculated as exp(x) e.g. \(e^{3}\) is exp(3) in R or =exp(3) in Excel.

    • Estimated LN Odds from Logistic Regression are converted to probability for interpretation (see below)

log_odds <- -1.4067                   # answer from HW 8 - Part 1 - Question 5
exp(log_odds)/(1 + exp(log_odds))     # calculation in R using exp function
plogis(log_odds)                      # calculation in R using plogis function
plogis(-1.4067)                       # calculation in R using plogis function and number
[1] 0.1967551
[1] 0.1967551
[1] 0.1967551

Key Points from Today

Topics covered in Quiz 2

  • Simple Linear Regression (From Quiz 1)
  • Multiple Linear Regression (with all quantitative terms)
  • Categorical Regression
    • Parallel Lines Models and Interaction Models
  • Model Selection: Backward, Forward, Stepwise and Best Subsets
  • Goodness of Model Fit: Adj. \(R^2\), AIC, Mallow’s C(p), RMSE
  • Logistic Regression
    • Odds, Log Odds, Converting Odds and Log Odds to Prob.
    • Model Estimates

To submit an Engagement Question or Comment about material from Today’s Lecture: Submit by midnight today (day of lecture). Click on Link next to the under today’s lecture.