BUA 345 - Lecture 19

Review for Quiz 2

Author

Penelope Pooler Eisenbies

Published

March 24, 2025

Housekeeping

HW 8 (Parts 1 and 2) is due tomorrow (3/26)
- Part 1 of HW 8 pertained to Lectures 15 - 17
- Part 2 of HW 8 pertains to Lecture 18 on Logistic Regression
- Grace period ends Thursday (3/27)
NO CLASS ON THURSDAY, 3/27
NO OFFICE HOURS ON THURSDAY, 3/27
Quiz 2 is Tuesday, 4/1 in class
- There will NOT be an asynchronous option.
- Practice Questions are available and demo videos will be posted by Thursday.
Quiz 2 is primarily based on material from
- Lectures 9 - 18
- HW Assignments 5, 6, 7, 8 Pt. 1, 8 Pt. 2

Lectures 9 - 11 (HW 5)

Correlation, SLR, and MLR

Simple Linear Regression and Multiple Linear Regression
How to calculate and interpret a correlation matrix in R
Review of Scatterplot Matrices

Rows: 200
Columns: 4
$ Price     <dbl> 217314, 238792, 222330, 206688, 88207, 2…
$ Area      <dbl> 2498, 2250, 2712, 2284, 1480, 2300, 957,…
$ Bathrooms <dbl> 2.5, 2.5, 3.0, 2.5, 1.5, 2.5, 1.0, 2.0, …
$ Age       <dbl> 14, 10, 1, 17, 14, 16, 49, 18, 88, 49, 3…

Price	Area	Bathrooms	Age
217314	2498	2.5	14
238792	2250	2.5	10
222330	2712	3.0	1
206688	2284	2.5	17

bua345s25 Lecture 19 In-class Exercises - Q1-Q2 bua345s25

Session ID: bua345s25

Question 1:

What is the correlation between House_Age and Living_Area in the houses dataset?

Question 2:

Are there any multicollinear variables in the following dataset?

          Price  Area Bathrooms   Age
Price      1.00  0.77      0.71 -0.38
Area       0.77  1.00      0.66 -0.22
Bathrooms  0.71  0.66      1.00 -0.52
Age       -0.38 -0.22     -0.52  1.00

Scatterplot Matrices

Shows all pairwise scatterplots

Specifing a SLR or MLR model in R

Model specified with ols_regress in the olsrr package OR with lm (base R command)

Model format is always the same
Interpretation of $R^2$ in SLR

                              Model Summary                                
--------------------------------------------------------------------------
R                           0.772       RMSE                    45426.628 
R-Squared                   0.596       MSE                2063578544.951 
Adj. R-Squared              0.594       Coef. Var                  27.670 
Pred R-Squared              0.579       AIC                      4863.117 
MAE                     31692.288       SBC                      4873.012 
--------------------------------------------------------------------------
 RMSE: Root Mean Square Error 
 MSE: Mean Square Error 
 MAE: Mean Absolute Error 
 AIC: Akaike Information Criteria 
 SBC: Schwarz Bayesian Criteria 

                                       ANOVA                                         
------------------------------------------------------------------------------------
                         Sum of                                                     
                        Squares         DF         Mean Square       F         Sig. 
------------------------------------------------------------------------------------
Regression     609852999259.857          1    609852999259.857    292.576    0.0000 
Residual       412715708990.143        198      2084422772.677                      
Total         1022568708250.000        199                                          
------------------------------------------------------------------------------------

                                       Parameter Estimates                                        
-------------------------------------------------------------------------------------------------
      model         Beta    Std. Error    Std. Beta      t        Sig         lower        upper 
-------------------------------------------------------------------------------------------------
(Intercept)    16505.199      9262.237                  1.782    0.076    -1760.095    34770.493 
       Area       82.588         4.828        0.772    17.105    0.000       73.066       92.110 
-------------------------------------------------------------------------------------------------

bua345s25 Lecture 19 In-class Exercises - Q3 bua345s25

Session ID: bua345s25

The correlation between Selling_Price and Living_Area is 0.772 and the $R^2$ for the SLR model is 0.596.

What proportion of the variability in selling price is explained by living area?

bua345s25 Lecture 19 In-class Exercises - Q4 bua345s25

Session ID: bua345s25

Residual = Observed Y - Est. Y = Model Response - Model Estimate*

What is the residual for the second house shown in the data below?

Price	Area	Bathrooms	Age	Est_Selling_Price
217314	2498	2.5	14	229114
238792	2250	2.5	10	215025
222330	2712	3.0	1	260195
206688	2284	2.5	17	215436

Additional Questions about MLR

(Not in PointSolutions)

Why is the natural log (LN) transformation of Y is sometimes needed?
- Recall in R the command to do this is log. In Excel it is ln
- How do we back transform estimates from a model when LN(Y) is the response?
  - Can be done in Excel or R using exp function
How to interpret Multiple Linear Regression output
- What hypothesis is being tested in each line of output?
- What do we conclude if the P-value (sometimes labeled Sig) is greater than 0.05?
- Note that in Backward Elimination we set a P-value cutoff of 0.1 (prem = 0.1), but we can later exclude variables when determining the final model.
- Also note that Backward Elimination can alternatively be done using AIC or Adjusted $R^2$.

Lectures 13 and 14 (HW 6)

Categorical Regression - Parallel Lines Model

How do we determine if there two or more separate intercepts?
NOTE that slopes for ALL categories are the same in a parallel lines model.

HW 6 Remodeled Houses Model Equations:

Model for un-remodeled Houses:
- Price = 166419.209 + 118.14*Square_Feet
For Remodeled Houses combine baseline intercept with difference due to remodeling (RemodeledYes)
Model for Remodeled Houses:
- Price = 166419.209 + 118.14*Square_Feet + 90325.284
- Price = (166419.209 + 90325.284) + 118.14*Square_Feet
- Price = 256744.5 + 118.14*Square_Feet

Lectures 13 and 14 (HW 6)

Categorical Regression - Interaction Model (Practice Questions 15 - 21)

How do we determine if there two or more separate intercepts?
How is this model different from Parallel Lines Model
How do we determine if there two or more different slopes?

HW 6 Diamonds Model Equations:

Model for Colorless Diamonds:
- Price = -4446.56 + 10476.13*Weight
Model for Faint Yellow Diamonds:
- Price = -4446.56 + 10476.13*Weight + 3464.41 - 6670.53*Weight
- Price = (-4446.56 + 3464.41) + (10476.13 - 6670.53)*Weight
- Price = -982.15 + 3805.6*Weight

Lectures 15 - 17 (HW 8 - Part 1)

Model Selection

Examining Data using Correlation and Scatterplot Matrices (See above)
Definition of Multicollinearity and how to determine if two variables are multicollinear
Definitions and R commands for the following methods
- Backward Elimination, Forward Selection, and Stepwise Selection
- Best Subsets (AIC, Mallows C(p), Adjusted $R^2$, RMSE)
Interpreting Measures of Model Fit
- Adjusted $R^2$, AIC, Mallow’s C(p), RMSE
Interpreting Final Model
- Same as for other MLR models and SLR models
- Remember to back transform estimate if LN transformation is used
- Residual = Observed Y - Estimate of Y

Lecture 18 (HW 8 - Part 2) - Logistic Regression

Definition of Odds: Odds is the ratio of the probability of an event occurring to the probability of it not occurring.
Converting Probability to Odds
- Probability is denoted as P or P(Event), e.g. P(Late Payment)
- $Odds = \frac{P(Event)}{1-P(Event)} = \frac{P}{1-P}$
Converting Odds to Probability (P)
- $P = \frac{Odds}{1+Odds}$
LN Odds are used as link function in Logistic Regression

Logistic Regression

Logistic Regression is used when Y is binary, a categorical variable with two categories such as:
- Yes or No
- Passed or Failed
- Survived or Not Survived (Titanic Example in Lecture 18)
- Late Payment or Not (Examples in Lecture 18, HW 8, and Practice Questions)
We specify the Logistic Regression Model in almost the same way as a MLR model EXCEPT we use glm(generalized linear model) instead of lm (linear model).
- GLM relaxes the LM assumption that the response is quantitative and normal.

Back Transforming Logistic Regression Estimates

Estimated Response, Y’, is the LN Odds of an Event
Convert LN Odds, Y’ to Probability as: $P = \frac{e^{Y'}}{1 + e{Y'}}$
- Recall that in R and Excel:
  - $e^{x}$ is calculated as exp(x)
  - $e^{3}$ is exp(3) in R or =exp(3) in Excel.
- Estimated LN Odds from Logistic Regression are converted to probability for interpretation (next slide).

bua345s25 Lecture 19 In-class Exercises - Q5 bua345s25

The log odds for survival of a female child in second class was 2.0873 (see worksheet from Lecture 18).

What was the probability of survival for a female child in second class?

Examples of Back Transformation Calculations in R

These calculations can be done in the console or a .qmd file

Code

```{r log odds to probability example, echo=T}
log_odds <- -1.4067                   # answer from HW 8 - Part 1 - Question 5
exp(log_odds)/(1 + exp(log_odds))     # calculation in R using exp function
exp(-1.4067)/(1+exp(-1.4067))
plogis(log_odds)                      # calculation in R using plogis function
plogis(-1.4067)                       # calculation in R using plogis function and number
```

[1] 0.1967551
[1] 0.1967551
[1] 0.1967551
[1] 0.1967551

Key Points

Topics covered in Quiz 2

Simple Linear Regression (From Quiz 1)
Multiple Linear Regression (with all quantitative terms)
Categorical Regression
- Parallel Lines Models and Interaction Models
Model Selection: Backward, Forward, Stepwise and Best Subsets
Goodness of Model Fit: Adj. $R^2$, AIC, Mallow’s C(p), RMSE
Logistic Regression
- Odds, Log Odds, Converting Odds and Log Odds to Prob.
- Model Estimates

To submit an Engagement Question or Comment about material from Lecture 19: Submit it by midnight today (day of lecture).

--- title: "BUA 345 - Lecture 19" subtitle: "Review for Quiz 2" author: "Penelope Pooler Eisenbies" date: last-modified lightbox: true toc: true toc-depth: 3 toc-location: left toc-title: "Table of Contents" toc-expand: 1 format: html: code-line-numbers: true code-fold: true code-tools: true execute: echo: fenced --- ## Housekeeping ```{r setup, echo=FALSE, warning=F, message=F, include=F} #| include: false # this line specifies options for default options for all R Chunks knitr::opts_chunk$set(echo=F) # suppress scientific notation options(scipen=100) # install helper package that loads and installs other packages, if needed if (!require("pacman")) install.packages("pacman", repos = "http://lib.stat.cmu.edu/R/CRAN/") # install and load required packages pacman::p_load(pacman,tidyverse, magrittr, olsrr, blorr,gridExtra, ggiraphExtra, knitr, viridis, png, kableExtra) # verify packages # p_loaded() ``` - **HW 8 (Parts 1 and 2) is due tomorrow (3/26)** - Part 1 of HW 8 pertained to Lectures 15 - 17 - Part 2 of HW 8 pertains to Lecture 18 on Logistic Regression - Grace period ends Thursday (3/27) - **NO CLASS ON THURSDAY, 3/27** - **NO OFFICE HOURS ON THURSDAY, 3/27** - **Quiz 2 is Tuesday, 4/1 in class** - There will NOT be an asynchronous option. - Practice Questions are available and demo videos will be posted by Thursday. - Quiz 2 is primarily based on material from - Lectures 9 - 18 - HW Assignments 5, 6, 7, 8 Pt. 1, 8 Pt. 2 ## ### Lectures 9 - 11 (HW 5) ::::::::: columns ::::: {.column width="50%"} ::: fragment **Correlation, SLR, and MLR** ::: - Simple Linear Regression and Multiple Linear Regression - How to calculate and interpret a correlation matrix in R - Review of Scatterplot Matrices ::: fragment ```{r import and glimpse data} # imported dataset is saved as an object named houses houses <- read_csv("data/houses.csv", show_col_types=F) |> glimpse(width=60) ``` ::: ::::: ::: {.column width="2%"} ::: :::: {.column width="48%"} ::: fragment ```{r view data} houses |> head(4) |> kable() |> kable_styling(full_width = F) ``` ::: :::: ::::::::: ## ### bua345s25 Lecture 19 In-class Exercises - Q1-Q2 bua345s25 ***Session ID: bua345s25*** **Question 1:** What is the correlation between `House_Age` and `Living_Area` in the houses dataset? **Question 2:** Are there any multicollinear variables in the following dataset? <br> ```{r examine multicollinearity in houses data} houses |> cor() |> round(2) # correlation matrix ``` ## ## Scatterplot Matrices - Shows all pairwise scatterplots ```{r scatterplot matrix, echo=F, out.extra='style="background-color: #3D3D3D; padding:1px;"'} houses|> pairs() ``` ## ### Specifing a SLR or MLR model in R Model specified with `ols_regress` in the `olsrr` package OR with `lm` (base R command) - Model format is always the same - Interpretation of $R^2$ in SLR ::: fragment ```{r houses slr model and output} (houses_slr <- ols_regress(Price ~ Area, data = houses)) ``` ::: ## ### bua345s25 Lecture 19 In-class Exercises - Q3 bua345s25 ***Session ID: bua345s25*** The correlation between `Selling_Price` and `Living_Area` is 0.772 and the $R^2$ for the SLR model is 0.596. What proportion of the variability in selling price is explained by living area? ## ### bua345s25 Lecture 19 In-class Exercises - Q4 bua345s25 ***Session ID: bua345s25*** **Residual = Observed Y - Est. Y = Model Response - Model Estimate**\* What is the residual for the second house shown in the data below? ```{r houses mlr model and output} houses_mlr <- ols_regress(Price ~ Area + Bathrooms + Age, data = houses) # specify model houses <- houses |> mutate(Est_Selling_Price = lm(houses_mlr$model) |> predict(houses) |> round()) # add regression estimates head(houses, 4) |> kable() ``` ## ### Additional Questions about MLR ***(Not in PointSolutions)*** - **Why is the natural log (LN) transformation of Y is sometimes needed?** - Recall in R the command to do this is `log`. In Excel it is `ln` - How do we back transform estimates from a model when LN(Y) is the response? - Can be done in Excel or R using `exp` function - How to interpret Multiple Linear Regression output - **What hypothesis is being tested in each line of output?** - **What do we conclude if the P-value (sometimes labeled `Sig`) is greater than 0.05?** - Note that in Backward Elimination we set a P-value cutoff of 0.1 (`prem = 0.1`), but we can later exclude variables when determining the final model. - Also note that Backward Elimination can alternatively be done using AIC or Adjusted $R^2$. ## ### Lectures 13 and 14 (HW 6) ::: fragment **Categorical Regression - Parallel Lines Model** ::: - How do we determine if there two or more separate intercepts? - **NOTE** that slopes for ALL categories are the same in a parallel lines model. ::: fragment **HW 6 Remodeled Houses Model Equations:** ::: - Model for un-remodeled Houses: - `Price = 166419.209 + 118.14*Square_Feet` - For Remodeled Houses combine baseline intercept with difference due to remodeling (`RemodeledYes`) - Model for Remodeled Houses: - `Price = 166419.209 + 118.14*Square_Feet + 90325.284` - `Price = (166419.209 + 90325.284) + 118.14*Square_Feet` - `Price = 256744.5 + 118.14*Square_Feet` ## ### Lectures 13 and 14 (HW 6) ::: fragment **Categorical Regression - Interaction Model (Practice Questions 15 - 21)** ::: - How do we determine if there two or more separate intercepts? - How is this model different from Parallel Lines Model - How do we determine if there two or more different slopes? ::: fragment **HW 6 Diamonds Model Equations:** ::: - Model for Colorless Diamonds: - `Price = -4446.56 + 10476.13*Weight` - Model for Faint Yellow Diamonds: - `Price = -4446.56 + 10476.13*Weight + 3464.41 - 6670.53*Weight` - `Price = (-4446.56 + 3464.41) + (10476.13 - 6670.53)*Weight` - `Price = -982.15 + 3805.6*Weight` ## ### Lectures 15 - 17 (HW 8 - Part 1) **Model Selection** - Examining Data using Correlation and Scatterplot Matrices (See above) - Definition of **Multicollinearity** and how to determine if two variables are multicollinear - Definitions and **R commands** for the following methods - Backward Elimination, Forward Selection, and Stepwise Selection - Best Subsets (AIC, Mallows C(p), Adjusted $R^2$, RMSE) - Interpreting Measures of Model Fit - Adjusted $R^2$, AIC, Mallow's C(p), RMSE - Interpreting Final Model - Same as for other MLR models and SLR models - Remember to back transform estimate if LN transformation is used - Residual = Observed Y - Estimate of Y ## ### Lecture 18 (HW 8 - Part 2) - Logistic Regression - Definition of Odds: Odds is the ratio of the probability of an event occurring to the probability of it not occurring. - Converting Probability to Odds - Probability is denoted as ***P*** or ***P(Event)***, e.g. ***P(Late Payment)*** - $Odds = \frac{P(Event)}{1-P(Event)} = \frac{P}{1-P}$ - Converting Odds to Probability (P) - $P = \frac{Odds}{1+Odds}$ - LN Odds are used as link function in Logistic Regression ## Logistic Regression - Logistic Regression is used when Y is binary, a categorical variable with two categories such as: - Yes or No - Passed or Failed - Survived or Not Survived (Titanic Example in Lecture 18) - Late Payment or Not (Examples in Lecture 18, HW 8, and Practice Questions) - We specify the Logistic Regression Model in almost the same way as a MLR model EXCEPT we use **`glm`(generalized linear model)** instead of **`lm` (linear model)**. - **GLM** relaxes the **LM** assumption that the response is quantitative and normal. ## ### Back Transforming Logistic Regression Estimates - **Estimated Response, Y', is the LN Odds of an Event** - **Convert LN Odds, Y' to Probability as:** $P = \frac{e^{Y'}}{1 + e{Y'}}$ - Recall that in R and Excel: - $e^{x}$ is calculated as `exp(x)` - $e^{3}$ is `exp(3)` in R or `=exp(3)` in Excel. - Estimated LN Odds from Logistic Regression are converted to probability for interpretation (next slide). ## ### bua345s25 Lecture 19 In-class Exercises - Q5 bua345s25 The log odds for survival of a female child in second class was 2.0873 (see worksheet from Lecture 18). What was the probability of survival for a female child in second class? #### Examples of Back Transformation Calculations in R These calculations can be done in the console or a .qmd file ```{r log odds to probability example, echo=T} log_odds <- -1.4067 # answer from HW 8 - Part 1 - Question 5 exp(log_odds)/(1 + exp(log_odds)) # calculation in R using exp function exp(-1.4067)/(1+exp(-1.4067)) plogis(log_odds) # calculation in R using plogis function plogis(-1.4067) # calculation in R using plogis function and number ``` ## ### Key Points Topics covered in Quiz 2 - Simple Linear Regression (From Quiz 1) - Multiple Linear Regression (with all quantitative terms) - Categorical Regression - Parallel Lines Models and Interaction Models - Model Selection: Backward, Forward, Stepwise and Best Subsets - Goodness of Model Fit: Adj. $R^2$, AIC, Mallow's C(p), RMSE - Logistic Regression - Odds, Log Odds, Converting Odds and Log Odds to Prob. - Model Estimates ::: fragment **To submit an Engagement Question or Comment about material from Lecture 19:** Submit it by midnight today (day of lecture). :::