Rows: 200
Columns: 4
$ Price <dbl> 217314, 238792, 222330, 206688, 88207, 2…
$ Area <dbl> 2498, 2250, 2712, 2284, 1480, 2300, 957,…
$ Bathrooms <dbl> 2.5, 2.5, 3.0, 2.5, 1.5, 2.5, 1.0, 2.0, …
$ Age <dbl> 14, 10, 1, 17, 14, 16, 49, 18, 88, 49, 3…
BUA 345 - Lecture 19
Review for Quiz 2
Housekeeping
HW 8 (Parts 1 and 2) is due tomorrow (3/26)
Part 1 of HW 8 pertained to Lectures 15 - 17
Part 2 of HW 8 pertains to Lecture 18 on Logistic Regression
Grace period ends Thursday (3/27)
NO CLASS ON THURSDAY, 3/27
NO OFFICE HOURS ON THURSDAY, 3/27
Quiz 2 is Tuesday, 4/1 in class
There will NOT be an asynchronous option.
Practice Questions are available and demo videos will be posted by Thursday.
Quiz 2 is primarily based on material from
Lectures 9 - 18
HW Assignments 5, 6, 7, 8 Pt. 1, 8 Pt. 2
Lectures 9 - 11 (HW 5)
Correlation, SLR, and MLR
Simple Linear Regression and Multiple Linear Regression
How to calculate and interpret a correlation matrix in R
Review of Scatterplot Matrices
Price | Area | Bathrooms | Age |
---|---|---|---|
217314 | 2498 | 2.5 | 14 |
238792 | 2250 | 2.5 | 10 |
222330 | 2712 | 3.0 | 1 |
206688 | 2284 | 2.5 | 17 |
bua345s25 Lecture 19 In-class Exercises - Q1-Q2 bua345s25
Session ID: bua345s25
Question 1:
What is the correlation between House_Age
and Living_Area
in the houses dataset?
Question 2:
Are there any multicollinear variables in the following dataset?
Price Area Bathrooms Age
Price 1.00 0.77 0.71 -0.38
Area 0.77 1.00 0.66 -0.22
Bathrooms 0.71 0.66 1.00 -0.52
Age -0.38 -0.22 -0.52 1.00
Scatterplot Matrices
- Shows all pairwise scatterplots
Specifing a SLR or MLR model in R
Model specified with ols_regress
in the olsrr
package OR with lm
(base R command)
Model format is always the same
Interpretation of \(R^2\) in SLR
Model Summary
--------------------------------------------------------------------------
R 0.772 RMSE 45426.628
R-Squared 0.596 MSE 2063578544.951
Adj. R-Squared 0.594 Coef. Var 27.670
Pred R-Squared 0.579 AIC 4863.117
MAE 31692.288 SBC 4873.012
--------------------------------------------------------------------------
RMSE: Root Mean Square Error
MSE: Mean Square Error
MAE: Mean Absolute Error
AIC: Akaike Information Criteria
SBC: Schwarz Bayesian Criteria
ANOVA
------------------------------------------------------------------------------------
Sum of
Squares DF Mean Square F Sig.
------------------------------------------------------------------------------------
Regression 609852999259.857 1 609852999259.857 292.576 0.0000
Residual 412715708990.143 198 2084422772.677
Total 1022568708250.000 199
------------------------------------------------------------------------------------
Parameter Estimates
-------------------------------------------------------------------------------------------------
model Beta Std. Error Std. Beta t Sig lower upper
-------------------------------------------------------------------------------------------------
(Intercept) 16505.199 9262.237 1.782 0.076 -1760.095 34770.493
Area 82.588 4.828 0.772 17.105 0.000 73.066 92.110
-------------------------------------------------------------------------------------------------
bua345s25 Lecture 19 In-class Exercises - Q3 bua345s25
Session ID: bua345s25
The correlation between Selling_Price
and Living_Area
is 0.772 and the \(R^2\) for the SLR model is 0.596.
What proportion of the variability in selling price is explained by living area?
bua345s25 Lecture 19 In-class Exercises - Q4 bua345s25
Session ID: bua345s25
Residual = Observed Y - Est. Y = Model Response - Model Estimate*
What is the residual for the second house shown in the data below?
Price | Area | Bathrooms | Age | Est_Selling_Price |
---|---|---|---|---|
217314 | 2498 | 2.5 | 14 | 229114 |
238792 | 2250 | 2.5 | 10 | 215025 |
222330 | 2712 | 3.0 | 1 | 260195 |
206688 | 2284 | 2.5 | 17 | 215436 |
Additional Questions about MLR
(Not in PointSolutions)
Why is the natural log (LN) transformation of Y is sometimes needed?
Recall in R the command to do this is
log
. In Excel it isln
How do we back transform estimates from a model when LN(Y) is the response?
- Can be done in Excel or R using
exp
function
- Can be done in Excel or R using
How to interpret Multiple Linear Regression output
What hypothesis is being tested in each line of output?
What do we conclude if the P-value (sometimes labeled
Sig
) is greater than 0.05?Note that in Backward Elimination we set a P-value cutoff of 0.1 (
prem = 0.1
), but we can later exclude variables when determining the final model.Also note that Backward Elimination can alternatively be done using AIC or Adjusted \(R^2\).
Lectures 13 and 14 (HW 6)
Categorical Regression - Parallel Lines Model
How do we determine if there two or more separate intercepts?
NOTE that slopes for ALL categories are the same in a parallel lines model.
HW 6 Remodeled Houses Model Equations:
Model for un-remodeled Houses:
Price = 166419.209 + 118.14*Square_Feet
For Remodeled Houses combine baseline intercept with difference due to remodeling (
RemodeledYes
)Model for Remodeled Houses:
Price = 166419.209 + 118.14*Square_Feet + 90325.284
Price = (166419.209 + 90325.284) + 118.14*Square_Feet
Price = 256744.5 + 118.14*Square_Feet
Lectures 13 and 14 (HW 6)
Categorical Regression - Interaction Model (Practice Questions 15 - 21)
How do we determine if there two or more separate intercepts?
How is this model different from Parallel Lines Model
How do we determine if there two or more different slopes?
HW 6 Diamonds Model Equations:
Model for Colorless Diamonds:
Price = -4446.56 + 10476.13*Weight
Model for Faint Yellow Diamonds:
Price = -4446.56 + 10476.13*Weight + 3464.41 - 6670.53*Weight
Price = (-4446.56 + 3464.41) + (10476.13 - 6670.53)*Weight
Price = -982.15 + 3805.6*Weight
Lectures 15 - 17 (HW 8 - Part 1)
Model Selection
Examining Data using Correlation and Scatterplot Matrices (See above)
Definition of Multicollinearity and how to determine if two variables are multicollinear
Definitions and R commands for the following methods
Backward Elimination, Forward Selection, and Stepwise Selection
Best Subsets (AIC, Mallows C(p), Adjusted \(R^2\), RMSE)
Interpreting Measures of Model Fit
- Adjusted \(R^2\), AIC, Mallow’s C(p), RMSE
Interpreting Final Model
- Same as for other MLR models and SLR models
- Remember to back transform estimate if LN transformation is used
- Residual = Observed Y - Estimate of Y
Lecture 18 (HW 8 - Part 2) - Logistic Regression
Definition of Odds: Odds is the ratio of the probability of an event occurring to the probability of it not occurring.
Converting Probability to Odds
Probability is denoted as P or P(Event), e.g. P(Late Payment)
\(Odds = \frac{P(Event)}{1-P(Event)} = \frac{P}{1-P}\)
Converting Odds to Probability (P)
- \(P = \frac{Odds}{1+Odds}\)
LN Odds are used as link function in Logistic Regression
Logistic Regression
Logistic Regression is used when Y is binary, a categorical variable with two categories such as:
- Yes or No
- Passed or Failed
- Survived or Not Survived (Titanic Example in Lecture 18)
- Late Payment or Not (Examples in Lecture 18, HW 8, and Practice Questions)
We specify the Logistic Regression Model in almost the same way as a MLR model EXCEPT we use
glm
(generalized linear model) instead oflm
(linear model).- GLM relaxes the LM assumption that the response is quantitative and normal.
Back Transforming Logistic Regression Estimates
Estimated Response, Y’, is the LN Odds of an Event
Convert LN Odds, Y’ to Probability as: \(P = \frac{e^{Y'}}{1 + e{Y'}}\)
Recall that in R and Excel:
\(e^{x}\) is calculated as
exp(x)
\(e^{3}\) is
exp(3)
in R or=exp(3)
in Excel.
Estimated LN Odds from Logistic Regression are converted to probability for interpretation (next slide).
bua345s25 Lecture 19 In-class Exercises - Q5 bua345s25
The log odds for survival of a female child in second class was 2.0873 (see worksheet from Lecture 18).
What was the probability of survival for a female child in second class?
Examples of Back Transformation Calculations in R
These calculations can be done in the console or a .qmd file
Code
```{r log odds to probability example, echo=T}
log_odds <- -1.4067 # answer from HW 8 - Part 1 - Question 5
exp(log_odds)/(1 + exp(log_odds)) # calculation in R using exp function
exp(-1.4067)/(1+exp(-1.4067))
plogis(log_odds) # calculation in R using plogis function
plogis(-1.4067) # calculation in R using plogis function and number
```
[1] 0.1967551
[1] 0.1967551
[1] 0.1967551
[1] 0.1967551
Key Points
Topics covered in Quiz 2
Simple Linear Regression (From Quiz 1)
Multiple Linear Regression (with all quantitative terms)
Categorical Regression
- Parallel Lines Models and Interaction Models
Model Selection: Backward, Forward, Stepwise and Best Subsets
Goodness of Model Fit: Adj. \(R^2\), AIC, Mallow’s C(p), RMSE
Logistic Regression
Odds, Log Odds, Converting Odds and Log Odds to Prob.
Model Estimates
To submit an Engagement Question or Comment about material from Lecture 19: Submit it by midnight today (day of lecture).