BUA 345 - Lecture 18
Logistic Regression
Housekeeping
Upcoming Dates
HW 7 was due Yesterday (3/19)
HW 8 (Parts 1 and 2) are posted and due on Wednesday (3/26)
Part 1 of HW 8 pertained to Lectures 15 - 17
Part 2 of HW 8 pertains to today’s lecture on Logistic Regression
NO CLASS ON 3/27
Quiz 2 is Tuesday, April 1st
There will NOT be an asynchronous option.
Practice Questions will be posted this weekend.
NEW PACKAGE FOR LOGISTIC REGRESSION:
blorr
- If you are having trouble installing/loading any packages or components of R or RStudio, please come to office hour or make an appointment with me.
Lecture 18 In-class Exercises - Q1
Session ID: bua345s25
Review Question from Lecture 17 and HW 8.
In Lecture 17, we discussed three fit statistics, measures of how well the model ‘fits’ the data.
Fill in the blank(s) using the correct choice. Ideally our chosen model has the
____
Adjusted \(R^2\)____
Mallow’s C(p)____
AIC
Models we’ve covered so far
Up until now in MAS 261 and BUA 345:
All of our regression models (SLR and MLR) have had a response, Y, that was QUANTITATIVE and ideally normal (or transformed to be normal):
amount of sleep
selling price
natural log of insurance charges
etc.
How Logistic Regression is different
Today we’ll model responses (y variables) that are CATEGORICAL and BINARY, such as
yes or no
survive or not
disease or no disease
LOGISTIC REGRESSION is used when your response is BINARY (has two categories)
This is one type of generalization of the linear model format you’ve already learned.
There are MANY other model generalizations for other data types (not covered in BUA 345).
These models are called GENERALIZED LINEAR MODELS (GLM):
- We are relaxing or the assumption that the response, Y is quantitative and normal.
Logistic Regression Models
Probability and Odds
In order to understand logistic regression models, it is helpful to know:
The differences between Probability and Odds
How to convert Probability to Odds and vise versa
How to convert Log Odds to probability using
exp
ORplogis
function
We cover these concepts FIRST, because:
Log Odds, LN(ODDS), is the link (link function)
This function LINKS our two category response, Y, to our predictor (X) variables:
If we understand this link, then we can understand and interpret Logistic Regression analyses.
Probability and Odds are NOT the same!
Dice Example
The probability of any side of the die is 1/6:
\[P(1) = P(2) = P(3) = P(4) = P(5) = P(6) = 1/6 \]
The Odds of any side, e.g. rolling a 2 is as follows:
\[ Odds(2) = \frac{P(2)}{1-P(2)} = \frac{\frac{1}{6}}{1-\frac{1}{6}} = \frac{\frac{1}{6}}{\frac{5}{6}} = \frac{1}{5} \]
Lecture 18 In-class Exercises - Q2-Q3
Session ID: bua345s25
Question 2
The probability(p) of rolling a 3 or higher with a single die is 4/6 or 2/3, i.e., \(P(3,4,5,6) = \frac{2}{3}\).
What are the ODDS of rolling a 3 or higher?
Hint: It is possible for odds to be greater than 1.
Question 3
If you flip a fair coin, the probability of heads = probability of tails = 1/2, i.e., \(P(Heads) = P(Tails) = \frac{1}{2}\)
What are the ODDS of a coin landing on heads?
Odds and Betting
Horse-racing and betting are where people commonly hear the term odds.
For example, a horse in a race has 9 to 1 odds of winning.
What is the probability this horse will win?
How do we calculate probability from odds?
\[ Probability = \frac{Odds}{1 + Odds} \]
9 to 1 odds means Odds = \(\frac{1}{9}\)
\[ P(Win) = \frac{\frac{1}{9}}{1 + \frac{1}{9}} = \frac{\frac{1}{9}}{\frac{10}{9}} = \frac{1}{10} = 0.1 \]
Conclusion: A horse with 9 to 1 odds of winning has a probability of 0.1 or a 10% chance of winning the horse race.
Lecture 18 In-class Exercises - Q4
Session ID: bua345s25
In Texas Hold’em, each player get’s two face down cards in their ‘pocket’. There are also three face up cards that everyone can use to make their best hand.
The odds of getting a ‘Pocket Pair’, a pair of two cards of the same value in the face down cards is 16 to 1.
\[Odds(Pocket Pair) = \frac{1}{16}\]
What is the probability of a pocket pair? Round your answer to 3 decimal places
Recall:
\[Probability = \frac{Odds}{1 + Odds}\]
Why Odds are useful to Logistic Regression
Probability is more intuitive
Odds, and more specifically, LN(Odds) are the magic LINK between a binary response and our predictor variables.
When we do Logistic Regression, the estimated response will be LN(Odds)
Just like we back transform LN(Y) to get Y, we can convert LN(Odds) to probabilities.
Why back-transform Log-odds
We DON’T want to know:
- the estimated log odds of winning a game
- the estimated log odds that it will rain tomorrow
We DO want to know:
- the estimated PROBABILITY of winning a game
- the estimated PROBABILITY that it will rain tomorrow
Back-Transforming Log-Odds in Excel and R
We can convert the estimated log odds to a probability in Excel or R:
Y’ = Estimated Log Odds = Estimate of \(LN(\frac{P}{1-P})\) from model
Y’ is the model estimate that we want to convert to a probability
Est. Probability, \(P = \frac{e^{Y'}}{1+e^{Y'}}\)
This can be calculated in Excel or R using the
exp
function:This conversion can be done more simply in R with the
plogis
function
Lecture 18 In-class Exercises - Q5
Session ID: bua345s25
Converting Log Odds:
A logistic regression model predicts the log odds that a small business owner will be audited by the IRS based on relevant predictor variables.
Based on this model, a cafe in Syracuse determines that the estimated log odds of being audited in 2024 is -1.946, Y’ = -1.946
If Estimated Log Odds = Y’ = -1.946, what is the probability that this cafe will be audited?
Recall: Est. Probability, \(P = \frac{e^{Y'}}{1+e^{Y'}}\)
Logistic Regression in R
The following examples will show how logistic regression uses LN(Odds) as a link function to model a two category (binary) response.
The model estimates, LN(Odds) can be converted to probabilities
For HWs and quizzes and the final exam you will be expected to:
calculate odds from probabilities.
calculate probabilities from odds.
estimate probabilities from log odds estimates from a logistic regression model.
Titanic data
More complete version of Titanic passenger data than we worked with before.
All data are CATEGORICAL:
Response, Y, is Survived: Yes or No
Predictors, X variables, are:
Class (of Passenger ticket): First, Second, Third , Crew
Age (Category): Adult, Child
Gender: Male, Female
What is the estimated probability of survival based on gender, age category, and passenger class?
Verifying Data for Logistic Regression
In order to use Logistic Regression the following conditions must be met:
Each category of each predictor variable has more than ONE observation in each response category
MOST categories have more than FIVE observations in each response category
Class | Age | Gender | Survived |
---|---|---|---|
Crew | Adult | Male | Yes |
First | Adult | Female | Yes |
Crew | Adult | Male | No |
Third | Adult | Male | No |
First | Adult | Male | No |
Crew | Adult | Male | No |
Third | Adult | Male | No |
First | Adult | Male | No |
Verifying Data for Logistic Regression
- These tables show that all categories of each predictor variable have some ‘Yes’ and some ‘No’ observations.
- All categories are represented in both categories of response, so all predictors can be used in model.
Code
Class
Survived Crew First Second Third
No 673 122 167 528
Yes 212 203 118 178
Age
Survived Adult Child
No 1438 52
Yes 654 57
Gender
Survived Female Male
No 126 1364
Yes 344 367
Specifying the Logistic Regression Model
The command we use in R, glm
is used for Generalized Linear Models.
For BUA 345, you are expected to interpret the coefficients from a logistic regression model.
Use provided code to get model estimates, LN(Odds)
Convert Estimated LN(Odds) to probabilities
In more advanced analytics courses, we will talk about model fit and validation.
Steps for Specify and Displaying Model
Create a
glm
model namedtitanic_logistic
family=binomial(link = 'logit')
specifies the model has a binary (two-category) response
Output model results using
blr_regress
command fromblorr
package.- If
blorr
package doesn’t work, an alternative command issummary
- If
R code to Specify and Summarize Titanic Model
Model Overview
-------------------------------------------------------------------------
Data Set Resp Var Obs. Df. Model Df. Residual Convergence
-------------------------------------------------------------------------
data SurvivedF 2201 2200 2195 TRUE
-------------------------------------------------------------------------
Response Summary
--------------------------------------------------------
Outcome Frequency Outcome Frequency
--------------------------------------------------------
0 1490 1 711
--------------------------------------------------------
Maximum Likelihood Estimates
------------------------------------------------------------------
Parameter DF Estimate Std. Error z value Pr(>|z|)
------------------------------------------------------------------
(Intercept) 1 1.1862 0.1586 7.4805 0.0000
ClassFirst 1 0.8577 0.1573 5.4511 0.0000
ClassSecond 1 -0.1604 0.1738 -0.9231 0.3560
ClassThird 1 -0.9201 0.1486 -6.1923 0.0000
AgeChild 1 1.0615 0.2440 4.3501 0.0000
GenderMale 1 -2.4201 0.1404 -17.2358 0.0000
------------------------------------------------------------------
Association of Predicted Probabilities and Observed Responses
------------------------------------------------------------------
% Concordant 0.6768 Somers' D 0.6227
% Discordant 0.1574 Gamma 0.5195
% Tied 0.1658 Tau-a 0.2273
Pairs 1059390 c 0.7597
------------------------------------------------------------------
Logistic Regression Model Estimates and Fit
Part 2 of Output - Only Use Max. Likelihood Estimates in BUA 345
Add Log Odds and Probabilities to Titanic Data
Class | Age | Gender | Survived | Log_Odds_Survival | Prob_Survival |
---|---|---|---|---|---|
Crew | Adult | Male | Yes | -1.2339 | 0.2255 |
First | Adult | Female | Yes | 2.0438 | 0.8853 |
Crew | Adult | Male | No | -1.2339 | 0.2255 |
Third | Adult | Male | No | -2.1540 | 0.1040 |
First | Adult | Male | No | -0.3762 | 0.4070 |
Crew | Adult | Male | No | -1.2339 | 0.2255 |
Titanic Model Interpretation
We can use an Excel spreadsheet (like we did for MLR) to estimate log odds and probabilities for each category
For example:
What is the probability that a Female Adult Crew member survived?
What is the probability that a Male Adult in Third Class survived?
What is the probability that a Male Child in First Class survived?
Recall:
Baseline categories first alphabetically) are not shown.
Baseline Class: Crew
Baseline Age category: Adult
Baseline Gender = Female
Lecture 18 In-class Exercises - Q6-Q7
Session ID: bua345s25
Question 6
Based on the Titanic Logistic Regression Model, what is the probability that a Male Adult in Third Class survived?
Use provided worksheet to do calculation. Round answer to 3 decimal places.
Question 7
Based on the Titanic Logistic Regression Model, what is the probability that a Male Child in First Class survived?
Use provided worksheet to do calculation. Round answer to 3 decimal places.
Two Plots for Context
- The titanic data are commonly used because the differences between class and gender are so clear.
- When looking at probabilities by category, it is also important to look at numbers of observations in each category.
Late Payment Data
Logistic Regression w/ Quantitative Predictors
- All three variables are QUANTITATIVE
Age | Dependents | Debt_Ratio | Late_Payment |
---|---|---|---|
57 | 0 | 0.39 | No |
34 | 0 | 0.51 | Yes |
42 | 0 | 0.48 | No |
34 | 0 | 0.24 | No |
66 | 0 | 0.15 | No |
53 | 0 | 0.11 | No |
50 | 2 | 0.31 | No |
46 | 1 | 0.35 | No |
29 | 0 | 0.25 | No |
57 | 0 | 0.08 | No |
Correlation Matrix - No Multicollinearity
All correlations between X variables are \(\lt 0.8\)
Steps to Specify and Summarize Late Payment
Create a
glm
model namedlatepmt_logistic
Output model results using
blr_regress
(blorr
package) orsummary
Part 2 of Output (Only Use Max. Likelihood Estimates in BUA 345)
Estimate
column shows the estimated beta coefficients for our model.We will use these beta coefficients in Excel to show how regression estimates of log odds and probability are calculated for each observation (consumer) in the data.
Pr(>|z|)
column shows the P-value for each beta coefficient.
Add Log Odds and Probabilities to Late Payment Data
Age | Dependents | Debt_Ratio | Late_Payment | Log_Odds_LP | Prob_LP |
---|---|---|---|---|---|
57 | 0 | 0.39 | No | -1.6849 | 0.1564 |
34 | 0 | 0.51 | Yes | -1.3017 | 0.2139 |
42 | 0 | 0.48 | No | -1.4215 | 0.1944 |
34 | 0 | 0.24 | No | -1.6111 | 0.1664 |
66 | 0 | 0.15 | No | -2.0561 | 0.1134 |
53 | 0 | 0.11 | No | -1.9631 | 0.1231 |
Lecture 18 In-class Exercises - Q8
Session ID: bua345s25
Late Payment Model Interpretation
We can use an Excel spreadsheet (like we did for MLR) to estimate log odds and probabilities for each category
For example:
What is the probability of a late payment for a 30 year old with 1 dependent and a debt ratio of 0.3?
What would be the Percent Change in debt ratio if another 30 year old had 4 dependents and debt ratio of 0.5? (HW 8)
Use provided worksheet to do calculation. Round answer to 3 decimal places.
HW 8 - Part 2 and Upcoming Dates
HW 7 was due Yesterday (3/19)
HW 8 (Parts 1 and 2) are posted and due on Wednesday (3/26)
Part 1 of HW 8 pertained to Lectures 15 - 17
Part 2 of HW 8 pertains to today’s lecture on Logistic Regression
NO CLASS ON 3/27
Quiz 2 is Tuesday, April 1st
- Practice Questions are posted and videos will be posted next week.
Key Points from this Week
Logistic Regression is useful for predicting outcomes
Helpful in decision making - provides probabilities
Underlying math of GLM is more complex
Software (such as R) allows students to bypass math and focus on interpretation
Important to understand
Difference between Odds and Probabilities
How to convert odds to probabilities and vise versa
How to convert log odds to probabilities (you can use
plogis
function in R)
To submit an Engagement Question or Comment about material from Lecture 18: Submit it by midnight today (day of lecture).