BUA 345 - Lecture 19

Review for Quiz 2

Author

Penelope Pooler Eisenbies

Published

March 24, 2025

Housekeeping

  • HW 8 (Parts 1 and 2) is due tomorrow (3/26)

    • Part 1 of HW 8 pertained to Lectures 15 - 17

    • Part 2 of HW 8 pertains to Lecture 18 on Logistic Regression

    • Grace period ends Thursday (3/27)

  • NO CLASS ON THURSDAY, 3/27

  • NO OFFICE HOURS ON THURSDAY, 3/27

  • Quiz 2 is Tuesday, 4/1 in class

    • There will NOT be an asynchronous option.

    • Practice Questions are available and demo videos will be posted by Thursday.

  • Quiz 2 is primarily based on material from

    • Lectures 9 - 18

    • HW Assignments 5, 6, 7, 8 Pt. 1, 8 Pt. 2

Lectures 9 - 11 (HW 5)

Correlation, SLR, and MLR

  • Simple Linear Regression and Multiple Linear Regression

  • How to calculate and interpret a correlation matrix in R

  • Review of Scatterplot Matrices

Rows: 200
Columns: 4
$ Price     <dbl> 217314, 238792, 222330, 206688, 88207, 2…
$ Area      <dbl> 2498, 2250, 2712, 2284, 1480, 2300, 957,…
$ Bathrooms <dbl> 2.5, 2.5, 3.0, 2.5, 1.5, 2.5, 1.0, 2.0, …
$ Age       <dbl> 14, 10, 1, 17, 14, 16, 49, 18, 88, 49, 3…
Price Area Bathrooms Age
217314 2498 2.5 14
238792 2250 2.5 10
222330 2712 3.0 1
206688 2284 2.5 17

bua345s25 Lecture 19 In-class Exercises - Q1-Q2 bua345s25

Session ID: bua345s25

Question 1:

What is the correlation between House_Age and Living_Area in the houses dataset?

Question 2:

Are there any multicollinear variables in the following dataset?


          Price  Area Bathrooms   Age
Price      1.00  0.77      0.71 -0.38
Area       0.77  1.00      0.66 -0.22
Bathrooms  0.71  0.66      1.00 -0.52
Age       -0.38 -0.22     -0.52  1.00

Scatterplot Matrices

  • Shows all pairwise scatterplots

Specifing a SLR or MLR model in R

Model specified with ols_regress in the olsrr package OR with lm (base R command)

  • Model format is always the same

  • Interpretation of \(R^2\) in SLR

                              Model Summary                                
--------------------------------------------------------------------------
R                           0.772       RMSE                    45426.628 
R-Squared                   0.596       MSE                2063578544.951 
Adj. R-Squared              0.594       Coef. Var                  27.670 
Pred R-Squared              0.579       AIC                      4863.117 
MAE                     31692.288       SBC                      4873.012 
--------------------------------------------------------------------------
 RMSE: Root Mean Square Error 
 MSE: Mean Square Error 
 MAE: Mean Absolute Error 
 AIC: Akaike Information Criteria 
 SBC: Schwarz Bayesian Criteria 

                                       ANOVA                                         
------------------------------------------------------------------------------------
                         Sum of                                                     
                        Squares         DF         Mean Square       F         Sig. 
------------------------------------------------------------------------------------
Regression     609852999259.857          1    609852999259.857    292.576    0.0000 
Residual       412715708990.143        198      2084422772.677                      
Total         1022568708250.000        199                                          
------------------------------------------------------------------------------------

                                       Parameter Estimates                                        
-------------------------------------------------------------------------------------------------
      model         Beta    Std. Error    Std. Beta      t        Sig         lower        upper 
-------------------------------------------------------------------------------------------------
(Intercept)    16505.199      9262.237                  1.782    0.076    -1760.095    34770.493 
       Area       82.588         4.828        0.772    17.105    0.000       73.066       92.110 
-------------------------------------------------------------------------------------------------

bua345s25 Lecture 19 In-class Exercises - Q3 bua345s25

Session ID: bua345s25

The correlation between Selling_Price and Living_Area is 0.772 and the \(R^2\) for the SLR model is 0.596.

What proportion of the variability in selling price is explained by living area?

bua345s25 Lecture 19 In-class Exercises - Q4 bua345s25

Session ID: bua345s25

Residual = Observed Y - Est. Y = Model Response - Model Estimate*

What is the residual for the second house shown in the data below?

Price Area Bathrooms Age Est_Selling_Price
217314 2498 2.5 14 229114
238792 2250 2.5 10 215025
222330 2712 3.0 1 260195
206688 2284 2.5 17 215436

Additional Questions about MLR

(Not in PointSolutions)

  • Why is the natural log (LN) transformation of Y is sometimes needed?

    • Recall in R the command to do this is log. In Excel it is ln

    • How do we back transform estimates from a model when LN(Y) is the response?

      • Can be done in Excel or R using exp function
  • How to interpret Multiple Linear Regression output

    • What hypothesis is being tested in each line of output?

    • What do we conclude if the P-value (sometimes labeled Sig) is greater than 0.05?

    • Note that in Backward Elimination we set a P-value cutoff of 0.1 (prem = 0.1), but we can later exclude variables when determining the final model.

    • Also note that Backward Elimination can alternatively be done using AIC or Adjusted \(R^2\).

Lectures 13 and 14 (HW 6)

Categorical Regression - Parallel Lines Model

  • How do we determine if there two or more separate intercepts?

  • NOTE that slopes for ALL categories are the same in a parallel lines model.

HW 6 Remodeled Houses Model Equations:

  • Model for un-remodeled Houses:

    • Price = 166419.209 + 118.14*Square_Feet
  • For Remodeled Houses combine baseline intercept with difference due to remodeling (RemodeledYes)

  • Model for Remodeled Houses:

    • Price = 166419.209 + 118.14*Square_Feet + 90325.284

    • Price = (166419.209 + 90325.284) + 118.14*Square_Feet

    • Price = 256744.5 + 118.14*Square_Feet

Lectures 13 and 14 (HW 6)

Categorical Regression - Interaction Model (Practice Questions 15 - 21)

  • How do we determine if there two or more separate intercepts?

  • How is this model different from Parallel Lines Model

  • How do we determine if there two or more different slopes?

HW 6 Diamonds Model Equations:

  • Model for Colorless Diamonds:

    • Price = -4446.56 + 10476.13*Weight
  • Model for Faint Yellow Diamonds:

    • Price = -4446.56 + 10476.13*Weight + 3464.41 - 6670.53*Weight
    • Price = (-4446.56 + 3464.41) + (10476.13 - 6670.53)*Weight
    • Price = -982.15 + 3805.6*Weight

Lectures 15 - 17 (HW 8 - Part 1)

Model Selection

  • Examining Data using Correlation and Scatterplot Matrices (See above)

  • Definition of Multicollinearity and how to determine if two variables are multicollinear

  • Definitions and R commands for the following methods

    • Backward Elimination, Forward Selection, and Stepwise Selection

    • Best Subsets (AIC, Mallows C(p), Adjusted \(R^2\), RMSE)

  • Interpreting Measures of Model Fit

    • Adjusted \(R^2\), AIC, Mallow’s C(p), RMSE
  • Interpreting Final Model

    • Same as for other MLR models and SLR models
    • Remember to back transform estimate if LN transformation is used
    • Residual = Observed Y - Estimate of Y

Lecture 18 (HW 8 - Part 2) - Logistic Regression

  • Definition of Odds: Odds is the ratio of the probability of an event occurring to the probability of it not occurring.

  • Converting Probability to Odds

    • Probability is denoted as P or P(Event), e.g. P(Late Payment)

    • \(Odds = \frac{P(Event)}{1-P(Event)} = \frac{P}{1-P}\)

  • Converting Odds to Probability (P)

    • \(P = \frac{Odds}{1+Odds}\)
  • LN Odds are used as link function in Logistic Regression

Logistic Regression

  • Logistic Regression is used when Y is binary, a categorical variable with two categories such as:

    • Yes or No
    • Passed or Failed
    • Survived or Not Survived (Titanic Example in Lecture 18)
    • Late Payment or Not (Examples in Lecture 18, HW 8, and Practice Questions)
  • We specify the Logistic Regression Model in almost the same way as a MLR model EXCEPT we use glm(generalized linear model) instead of lm (linear model).

    • GLM relaxes the LM assumption that the response is quantitative and normal.

Back Transforming Logistic Regression Estimates

  • Estimated Response, Y’, is the LN Odds of an Event

  • Convert LN Odds, Y’ to Probability as: \(P = \frac{e^{Y'}}{1 + e{Y'}}\)

    • Recall that in R and Excel:

      • \(e^{x}\) is calculated as exp(x)

      • \(e^{3}\) is exp(3) in R or =exp(3) in Excel.

    • Estimated LN Odds from Logistic Regression are converted to probability for interpretation (next slide).

bua345s25 Lecture 19 In-class Exercises - Q5 bua345s25

The log odds for survival of a female child in second class was 2.0873 (see worksheet from Lecture 18).

What was the probability of survival for a female child in second class?

Examples of Back Transformation Calculations in R

These calculations can be done in the console or a .qmd file

Code
```{r log odds to probability example, echo=T}
log_odds <- -1.4067                   # answer from HW 8 - Part 1 - Question 5
exp(log_odds)/(1 + exp(log_odds))     # calculation in R using exp function
exp(-1.4067)/(1+exp(-1.4067))
plogis(log_odds)                      # calculation in R using plogis function
plogis(-1.4067)                       # calculation in R using plogis function and number
```
[1] 0.1967551
[1] 0.1967551
[1] 0.1967551
[1] 0.1967551

Key Points

Topics covered in Quiz 2

  • Simple Linear Regression (From Quiz 1)

  • Multiple Linear Regression (with all quantitative terms)

  • Categorical Regression

    • Parallel Lines Models and Interaction Models
  • Model Selection: Backward, Forward, Stepwise and Best Subsets

  • Goodness of Model Fit: Adj. \(R^2\), AIC, Mallow’s C(p), RMSE

  • Logistic Regression

    • Odds, Log Odds, Converting Odds and Log Odds to Prob.

    • Model Estimates

To submit an Engagement Question or Comment about material from Lecture 19: Submit it by midnight today (day of lecture).