2024-03-21
HW 7 was due Yesterday (3/20)
HW 8 (Parts 1 and 2) are posted and due on Monday (3/25)
Part 1 of HW 8 pertained to Lectures 15 - 17
Part 2 of HW 8 pertains to today’s lecture on Logistic Regression
Quiz 2 is Thursday, March 28th
There will be an asynchronous option.
Practice Questions will be posted this weekend.
NEW PACKAGE FOR LOGISTIC REGRESSION: blorr
Session ID: bua345s24
Review Question from Lecture 17 and HW 8.
In Lecture 17, we discussed three fit statistics, measures of how well the model ‘fits’ the data.
Fill in the blank(s) using the correct choice. Ideally our chosen model has the
____
Adjusted \(R^2\)
____
Mallow’s C(p)
____
AIC
Up until now in MAS 261 and BUA 345:
All of our regression models (SLR and MLR) have had a response, Y, that was QUANTITATIVE and ideally normal (or transformed to be normal):
amount of sleep
selling price
natural log of insurance charges
etc.
Today we’ll model responses (y variables) that are CATEGORICAL and BINARY, such as
yes or no
survive or not
disease or no disease
LOGISTIC REGRESSION is used when your response is BINARY (has two categories)
This is one type of generalization of the linear model format you’ve already learned.
There are MANY other model generalizations for other data types (not covered in BUA 345).
These models are called GENERALIZED LINEAR MODELS (GLM):
We can’t model the dichotomous (two-category) response directly using the linear model.
Instead, we ‘LINK’ our binary response to our predictor variables using a link function.
The underlying math is interesting, but not needed for this course.
In order to understand logistic regression models, it is helpful to know:
The differences between Probability and Odds
How to convert Probability to Odds and vise versa
How to convert Log Odds to probability using exp
OR plogis
function
We cover these concepts FIRST, because:
Log Odds, LN(ODDS), is the link (link function)
This function LINKS our two category response, Y, to our predictor (X) variables:
If we understand this link, then we can understand and interpret Logistic Regression analyses.
Odds are often used incorrectly!
This news item indicates that the PROBABILITY of survival is 95.7%
Odds of Survival \(≠\) Prob. of Survival = P(Survival)
\[ Odds(Survival) = \frac{P(Survival)}{1-P(Survival)} = ? \]
Note: This question is in HW 8
The probability of any side of the die is 1/6:
\[P(1) = P(2) = P(3) = P(4) = P(5) = P(6) = 1/6 \]
The Odds of any side, e.g. rolling a 2 is as follows:
\[ Odds(2) = \frac{P(2)}{1-P(2)} = \frac{\frac{1}{6}}{1-\frac{1}{6}} = \frac{\frac{1}{6}}{\frac{5}{6}} = \frac{1}{5} \]
Session ID: bua345s24
Question 2
The probability(p) of rolling a 3 or higher with a single die is 4/6 or 2/3, i.e., \(P(3,4,5,6) = \frac{2}{3}\).
What are the ODDS of rolling a 3 or higher?
Hint: It is possible for odds to be greater than 1.
Question 3
If you flip a fair coin, the probability of heads = probability of tails = 1/2, i.e., \(P(Heads) = P(Tails) = \frac{1}{2}\)
What are the ODDS of a coin landing on heads?
Horse-racing and betting are where people commonly hear the term odds.
For example, a horse in a race has 9 to 1 odds of winning.
What is the probability this horse will win?
How do we calculate probability from odds?
\[ Probability = \frac{Odds}{1 + Odds} \]
9 to 1 odds means Odds = \(\frac{1}{9}\)
\[ P(Win) = \frac{\frac{1}{9}}{1 + \frac{1}{9}} = \frac{\frac{1}{9}}{\frac{10}{9}} = \frac{1}{10} = 0.1 \]
Conclusion: A horse with 9 to 1 odds of winning has a probability of 0.1 or a 10% chance of winning the horse race.
Session ID: bua345s24
In Texas Hold’em, each player get’s two face down cards in their ‘pocket’. There are also three face up cards that everyone can use to make their best hand.
The odds of getting a ‘Pocket Pair’, a pair of two cards of the same value in the face down cards is 16 to 1.
\[Odds(Pocket Pair) = \frac{1}{16}\]
What is the probability of a pocket pair? Round your answer to 3 decimal places
Recall:
\[Probability = \frac{Odds}{1 + Odds}\]
Probability is more intuitive
Odds, and more specifically, LN(Odds) are the magic LINK between a binary response and our predictor variables.
When we do Logistic Regression, the estimated response will be LN(Odds)
Just like we back transform LN(Y) to get Y, we can convert LN(Odds) to probabilities.
We DON’T want to know:
We DO want to know:
We can convert the estimated log odds to a probability in Excel or R:
Y’ = Estimated Log Odds = Estimate of \(LN(\frac{P}{1-P})\) from model
Y’ is the model estimate that we want to convert to a probability
Est. Probability, \(P = \frac{e^{Y'}}{1+e^{Y'}}\)
This can be calculated in Excel or R using the exp
function:
This conversion can be done more simply in R with the plogis
function
Session ID: bua345s24
Converting Log Odds:
A logistic regression model predicts the log odds that a small business owner will be audited by the IRS based on relevant predictor variables.
Based on this model, a cafe in Syracuse determines that the estimated log odds of being audited in 2024 is -1.946, Y’ = -1.946
If Estimated Log Odds = Y’ = -1.946, what is the probability that this cafe will be audited?
Recall: Est. Probability, \(P = \frac{e^{Y'}}{1+e^{Y'}}\)
The following examples will show how logistic regression uses LN(Odds) as a link function to model a two category (binary) response.
The model estimates, LN(Odds) can be converted to probabilities
For HWs and quizzes and the final exam you will be expected to:
calculate odds from probabilities.
calculate probabilities from odds.
estimate probabilities from log odds estimates from a logistic regression model.
More complete version of Titanic passenger data than we worked with before.
All data are CATEGORICAL:
Response, Y, is Survived: Yes or No
Predictors, X variables, are:
Class (of Passenger ticket): First, Second, Third , Crew
Age (Category): Adult, Child
Gender: Male, Female
What is the estimated probability of survival based on gender, age category, and passenger class?
In order to use Logistic Regression the following conditions must be met:
Each category of each predictor variable has more than ONE observation in each response category
MOST categories have more than FIVE observations in each response category
# import and examine data
titanic <- read_csv("data/titanic.csv", show_col_types = F) |>
glimpse(width=60)
Rows: 2,201
Columns: 4
$ Class <chr> "Crew", "First", "Crew", "Third", "First"…
$ Age <chr> "Adult", "Adult", "Adult", "Adult", "Adul…
$ Gender <chr> "Male", "Female", "Male", "Male", "Male",…
$ Survived <chr> "Yes", "Yes", "No", "No", "No", "No", "No…
Class
Survived Crew First Second Third
No 673 122 167 528
Yes 212 203 118 178
Age
Survived Adult Child
No 1438 52
Yes 654 57
Gender
Survived Female Male
No 126 1364
Yes 344 367
The command we use in R, glm
is used for Generalized Linear Models.
For BUA 345, you are expected to interpret the coefficients from a logistic regression model.
Use provided code to get model estimates, LN(Odds)
Convert Estimated LN(Odds) to probabilities
In more advanced analytics courses, we will talk about model fit and validation.
Create a glm
model named titanic_logistic
family=binomial(link = 'logit')
specifies the model has a binary (two-category) responseOutput model results using blr_regress
command from blorr
package.
blorr
package doesn’t work, an alternative command is summary
titanic <- titanic|> mutate(SurvivedF = factor(Survived)) # create factor variable
titanic_logistic <- glm(SurvivedF ~ Class + Age + Gender, data=titanic, # specify model
family=binomial(link = 'logit'))
blr_regress(titanic_logistic) # examine model output
# summary(titanic_logistic) # alternative if bls_regress doesn't work
Part 1 of Output (Not Essential for BUA 345)
Part 2 of Output (Only Use Max. Likelihood Estimates in BUA 345)
Estimate
column shows the estimated beta coefficients for our model.
Pr(>|z|)
column shows the P-value for each beta coefficient.
titanic <- titanic |> # create a new variable that shows log odds of survival for each passenger
mutate(Log_Odds_Survival = titanic_logistic |> glm(titanic, family = binomial) |> predict.glm() |> round(4))
# convert log odds to probabilities using plogis command
titanic_table <- titanic |> mutate(Prob_Survival = Log_Odds_Survival |> plogis() |> round(4)) |>
select(!SurvivedF)
head(titanic_table) |> kable()
Class | Age | Gender | Survived | Log_Odds_Survival | Prob_Survival |
---|---|---|---|---|---|
Crew | Adult | Male | Yes | -1.2339 | 0.2255 |
First | Adult | Female | Yes | 2.0438 | 0.8853 |
Crew | Adult | Male | No | -1.2339 | 0.2255 |
Third | Adult | Male | No | -2.1540 | 0.1040 |
First | Adult | Male | No | -0.3762 | 0.4070 |
Crew | Adult | Male | No | -1.2339 | 0.2255 |
We can use an Excel spreadsheet (like we did for MLR) to estimate log odds and probabilities for each category
For example:
What is the probability that a Female Adult Crew member survived?
What is the probability that a Male Adult in Third Class survived?
What is the probability that a Male Child in First Class survived?
Recall:
Baseline categories first alphabetically) are not shown.
Baseline Class: Crew
Baseline Age category: Adult
Baseline Gender = Female
Session ID: bua345s24
Question 6
Based on the Titanic Logistic Regression Model, what is the probability that a Male Adult in Third Class survived?
Use provided worksheet to do calculation. Round answer to 3 decimal places.
Question 7
Based on the Titanic Logistic Regression Model, what is the probability that a Male Child in First Class survived?
Use provided worksheet to do calculation. Round answer to 3 decimal places.
Logistic Regression with Quantitative Predictors
Data from the 7,550 consumers is used to create a logistic regression model
The goal is to estimate the probability of late payment on a credit card
Predictor (X) variables are:
All correlations between X variables are \(\lt 0.8\)
Create a glm
model named latepmt_logistic
Output model results using blr_regress
(blorr
package) or summary
Part 1 of Output (Not Essential for BUA 345)
Part 2 of Output (Only Use Max. Likelihood Estimates in BUA 345)
Estimate
column shows the estimated beta coefficients for our model.
Pr(>|z|)
column shows the P-value for each beta coefficient.
late_payment <- late_payment |> # create a new variable that shows log odds of survival for each consumer
mutate(Log_Odds_LP = latepmt_logistic |> glm(late_payment, family = binomial) |> predict.glm() |> round(4))
# convert log odds to probabilities using plogis command
late_payment_table <- late_payment |> mutate(Prob_LP = Log_Odds_LP |> plogis() |> round(4)) |>
select(!Late_PaymentF)
head(late_payment_table) |> kable()
Age | Dependents | Debt_Ratio | Late_Payment | Log_Odds_LP | Prob_LP |
---|---|---|---|---|---|
57 | 0 | 0.39 | No | -1.6849 | 0.1564 |
34 | 0 | 0.51 | Yes | -1.3017 | 0.2139 |
42 | 0 | 0.48 | No | -1.4215 | 0.1944 |
34 | 0 | 0.24 | No | -1.6111 | 0.1664 |
66 | 0 | 0.15 | No | -2.0561 | 0.1134 |
53 | 0 | 0.11 | No | -1.9631 | 0.1231 |
Session ID: bua345s24
Late Payment Model Interpretation
We can use an Excel spreadsheet (like we did for MLR) to estimate log odds and probabilities for each category
For example:
What is the probability of a late payment for a 30 year old with 1 dependent and a debt ratio of 0.3?
What would be the Percent Change in debt ratio if another 30 year old had 4 dependents and debt ratio of 0.5? (HW 8)
Use provided worksheet to do calculation. Round answer to 3 decimal places.
HW 7 was due Yesterday (3/20)
HW 8 (Parts 1 and 2) are posted and due on Monday (3/25)
Part 1 of HW 8 pertained to Lectures 15 - 17
Part 2 of HW 8 pertains to today’s lecture on Logistic Regression
Quiz 2 is Thursday, March 28th
There will be an asynchronous option.
Practice Questions will be posted this weekend.
Logistic Regression is useful for predicting outcomes
Helpful in decision making - provides probabilities
Underlying math of GLM is more complex
Software (such as R) allows students to bypass math and focus on interpretation
Important to understand
Difference between Odds and Probabilities
How to convert odds to probabilities and vise versa
How to convert log odds to probabilities (you can use plogis
function in R)
To submit an Engagement Question or Comment about material from Today’s Lecture: Submit by midnight today (day of lecture). Click on Link next to the ❓ under today’s lecture.