Logistic Regression
2025-03-20
HW 7 was due Yesterday (3/19)
HW 8 (Parts 1 and 2) are posted and due on Wednesday (3/26)
Part 1 of HW 8 pertained to Lectures 15 - 17
Part 2 of HW 8 pertains to today’s lecture on Logistic Regression
NO CLASS ON 3/27
Quiz 2 is Tuesday, April 1st
There will NOT be an asynchronous option.
Practice Questions will be posted this weekend.
NEW PACKAGE FOR LOGISTIC REGRESSION: blorr
Session ID: bua345s25
Review Question from Lecture 17 and HW 8.
In Lecture 17, we discussed three fit statistics, measures of how well the model ‘fits’ the data.
Fill in the blank(s) using the correct choice. Ideally our chosen model has the
____
Adjusted \(R^2\)
____
Mallow’s C(p)
____
AIC
Up until now in MAS 261 and BUA 345:
All of our regression models (SLR and MLR) have had a response, Y, that was QUANTITATIVE and ideally normal (or transformed to be normal):
amount of sleep
selling price
natural log of insurance charges
etc.
Today we’ll model responses (y variables) that are CATEGORICAL and BINARY, such as
yes or no
survive or not
disease or no disease
LOGISTIC REGRESSION is used when your response is BINARY (has two categories)
This is one type of generalization of the linear model format you’ve already learned.
There are MANY other model generalizations for other data types (not covered in BUA 345).
These models are called GENERALIZED LINEAR MODELS (GLM):
In order to understand logistic regression models, it is helpful to know:
The differences between Probability and Odds
How to convert Probability to Odds and vise versa
How to convert Log Odds to probability using exp
OR plogis
function
We cover these concepts FIRST, because:
Log Odds, LN(ODDS), is the link (link function)
This function LINKS our two category response, Y, to our predictor (X) variables:
If we understand this link, then we can understand and interpret Logistic Regression analyses.
Session ID: bua345s25
Question 2
The probability(p) of rolling a 3 or higher with a single die is 4/6 or 2/3, i.e., \(P(3,4,5,6) = \frac{2}{3}\).
What are the ODDS of rolling a 3 or higher?
Hint: It is possible for odds to be greater than 1.
Question 3
If you flip a fair coin, the probability of heads = probability of tails = 1/2, i.e., \(P(Heads) = P(Tails) = \frac{1}{2}\)
What are the ODDS of a coin landing on heads?
Horse-racing and betting are where people commonly hear the term odds.
For example, a horse in a race has 9 to 1 odds of winning.
What is the probability this horse will win?
How do we calculate probability from odds?
\[ Probability = \frac{Odds}{1 + Odds} \]
9 to 1 odds means Odds = \(\frac{1}{9}\)
\[ P(Win) = \frac{\frac{1}{9}}{1 + \frac{1}{9}} = \frac{\frac{1}{9}}{\frac{10}{9}} = \frac{1}{10} = 0.1 \]
Conclusion: A horse with 9 to 1 odds of winning has a probability of 0.1 or a 10% chance of winning the horse race.
Session ID: bua345s25
In Texas Hold’em, each player get’s two face down cards in their ‘pocket’. There are also three face up cards that everyone can use to make their best hand.
The odds of getting a ‘Pocket Pair’, a pair of two cards of the same value in the face down cards is 16 to 1.
\[Odds(Pocket Pair) = \frac{1}{16}\]
What is the probability of a pocket pair? Round your answer to 3 decimal places
Recall:
\[Probability = \frac{Odds}{1 + Odds}\]
Probability is more intuitive
Odds, and more specifically, LN(Odds) are the magic LINK between a binary response and our predictor variables.
When we do Logistic Regression, the estimated response will be LN(Odds)
Just like we back transform LN(Y) to get Y, we can convert LN(Odds) to probabilities.
We DON’T want to know:
We DO want to know:
We can convert the estimated log odds to a probability in Excel or R:
Y’ = Estimated Log Odds = Estimate of \(LN(\frac{P}{1-P})\) from model
Y’ is the model estimate that we want to convert to a probability
Est. Probability, \(P = \frac{e^{Y'}}{1+e^{Y'}}\)
This can be calculated in Excel or R using the exp
function:
This conversion can be done more simply in R with the plogis
function
Session ID: bua345s25
Converting Log Odds:
A logistic regression model predicts the log odds that a small business owner will be audited by the IRS based on relevant predictor variables.
Based on this model, a cafe in Syracuse determines that the estimated log odds of being audited in 2024 is -1.946, Y’ = -1.946
If Estimated Log Odds = Y’ = -1.946, what is the probability that this cafe will be audited?
Recall: Est. Probability, \(P = \frac{e^{Y'}}{1+e^{Y'}}\)
The following examples will show how logistic regression uses LN(Odds) as a link function to model a two category (binary) response.
The model estimates, LN(Odds) can be converted to probabilities
For HWs and quizzes and the final exam you will be expected to:
calculate odds from probabilities.
calculate probabilities from odds.
estimate probabilities from log odds estimates from a logistic regression model.
More complete version of Titanic passenger data than we worked with before.
All data are CATEGORICAL:
Response, Y, is Survived: Yes or No
Predictors, X variables, are:
Class (of Passenger ticket): First, Second, Third , Crew
Age (Category): Adult, Child
Gender: Male, Female
What is the estimated probability of survival based on gender, age category, and passenger class?
In order to use Logistic Regression the following conditions must be met:
Each category of each predictor variable has more than ONE observation in each response category
MOST categories have more than FIVE observations in each response category
Class | Age | Gender | Survived |
---|---|---|---|
Crew | Adult | Male | Yes |
First | Adult | Female | Yes |
Crew | Adult | Male | No |
Third | Adult | Male | No |
First | Adult | Male | No |
Crew | Adult | Male | No |
Third | Adult | Male | No |
First | Adult | Male | No |
Class
Survived Crew First Second Third
No 673 122 167 528
Yes 212 203 118 178
Age
Survived Adult Child
No 1438 52
Yes 654 57
Gender
Survived Female Male
No 126 1364
Yes 344 367
The command we use in R, glm
is used for Generalized Linear Models.
For BUA 345, you are expected to interpret the coefficients from a logistic regression model.
Use provided code to get model estimates, LN(Odds)
Convert Estimated LN(Odds) to probabilities
In more advanced analytics courses, we will talk about model fit and validation.
Create a glm
model named titanic_logistic
family=binomial(link = 'logit')
specifies the model has a binary (two-category) responseOutput model results using blr_regress
command from blorr
package.
blorr
package doesn’t work, an alternative command is summary
Model Overview
-------------------------------------------------------------------------
Data Set Resp Var Obs. Df. Model Df. Residual Convergence
-------------------------------------------------------------------------
data SurvivedF 2201 2200 2195 TRUE
-------------------------------------------------------------------------
Response Summary
--------------------------------------------------------
Outcome Frequency Outcome Frequency
--------------------------------------------------------
0 1490 1 711
--------------------------------------------------------
Maximum Likelihood Estimates
------------------------------------------------------------------
Parameter DF Estimate Std. Error z value Pr(>|z|)
------------------------------------------------------------------
(Intercept) 1 1.1862 0.1586 7.4805 0.0000
ClassFirst 1 0.8577 0.1573 5.4511 0.0000
ClassSecond 1 -0.1604 0.1738 -0.9231 0.3560
ClassThird 1 -0.9201 0.1486 -6.1923 0.0000
AgeChild 1 1.0615 0.2440 4.3501 0.0000
GenderMale 1 -2.4201 0.1404 -17.2358 0.0000
------------------------------------------------------------------
Association of Predicted Probabilities and Observed Responses
------------------------------------------------------------------
% Concordant 0.6768 Somers' D 0.6227
% Discordant 0.1574 Gamma 0.5195
% Tied 0.1658 Tau-a 0.2273
Pairs 1059390 c 0.7597
------------------------------------------------------------------
Part 2 of Output - Only Use Max. Likelihood Estimates in BUA 345
Class | Age | Gender | Survived | Log_Odds_Survival | Prob_Survival |
---|---|---|---|---|---|
Crew | Adult | Male | Yes | -1.2339 | 0.2255 |
First | Adult | Female | Yes | 2.0438 | 0.8853 |
Crew | Adult | Male | No | -1.2339 | 0.2255 |
Third | Adult | Male | No | -2.1540 | 0.1040 |
First | Adult | Male | No | -0.3762 | 0.4070 |
Crew | Adult | Male | No | -1.2339 | 0.2255 |
We can use an Excel spreadsheet (like we did for MLR) to estimate log odds and probabilities for each category
For example:
What is the probability that a Female Adult Crew member survived?
What is the probability that a Male Adult in Third Class survived?
What is the probability that a Male Child in First Class survived?
Recall:
Baseline categories first alphabetically) are not shown.
Baseline Class: Crew
Baseline Age category: Adult
Baseline Gender = Female
Session ID: bua345s25
Question 6
Based on the Titanic Logistic Regression Model, what is the probability that a Male Adult in Third Class survived?
Use provided worksheet to do calculation. Round answer to 3 decimal places.
Question 7
Based on the Titanic Logistic Regression Model, what is the probability that a Male Child in First Class survived?
Use provided worksheet to do calculation. Round answer to 3 decimal places.
Age | Dependents | Debt_Ratio | Late_Payment |
---|---|---|---|
57 | 0 | 0.39 | No |
34 | 0 | 0.51 | Yes |
42 | 0 | 0.48 | No |
34 | 0 | 0.24 | No |
66 | 0 | 0.15 | No |
53 | 0 | 0.11 | No |
50 | 2 | 0.31 | No |
46 | 1 | 0.35 | No |
29 | 0 | 0.25 | No |
57 | 0 | 0.08 | No |
All correlations between X variables are \(\lt 0.8\)
Create a glm
model named latepmt_logistic
Output model results using blr_regress
(blorr
package) or summary
Part 2 of Output (Only Use Max. Likelihood Estimates in BUA 345)
Estimate
column shows the estimated beta coefficients for our model.
We will use these beta coefficients in Excel to show how regression estimates of log odds and probability are calculated for each observation (consumer) in the data.
Pr(>|z|)
column shows the P-value for each beta coefficient.
Age | Dependents | Debt_Ratio | Late_Payment | Log_Odds_LP | Prob_LP |
---|---|---|---|---|---|
57 | 0 | 0.39 | No | -1.6849 | 0.1564 |
34 | 0 | 0.51 | Yes | -1.3017 | 0.2139 |
42 | 0 | 0.48 | No | -1.4215 | 0.1944 |
34 | 0 | 0.24 | No | -1.6111 | 0.1664 |
66 | 0 | 0.15 | No | -2.0561 | 0.1134 |
53 | 0 | 0.11 | No | -1.9631 | 0.1231 |
Session ID: bua345s25
Late Payment Model Interpretation
We can use an Excel spreadsheet (like we did for MLR) to estimate log odds and probabilities for each category
For example:
What is the probability of a late payment for a 30 year old with 1 dependent and a debt ratio of 0.3?
What would be the Percent Change in debt ratio if another 30 year old had 4 dependents and debt ratio of 0.5? (HW 8)
Use provided worksheet to do calculation. Round answer to 3 decimal places.
HW 7 was due Yesterday (3/19)
HW 8 (Parts 1 and 2) are posted and due on Wednesday (3/26)
Part 1 of HW 8 pertained to Lectures 15 - 17
Part 2 of HW 8 pertains to today’s lecture on Logistic Regression
NO CLASS ON 3/27
Quiz 2 is Tuesday, April 1st
Logistic Regression is useful for predicting outcomes
Helpful in decision making - provides probabilities
Underlying math of GLM is more complex
Software (such as R) allows students to bypass math and focus on interpretation
Important to understand
Difference between Odds and Probabilities
How to convert odds to probabilities and vise versa
How to convert log odds to probabilities (you can use plogis
function in R)
To submit an Engagement Question or Comment about material from Lecture 18: Submit it by midnight today (day of lecture).