Lecture 18 - Logistic Regression

Penelope Pooler Eisenbies
BUA 345

2024-03-21

Housekeeping

Upcoming Dates

  • HW 7 was due Yesterday (3/20)

  • HW 8 (Parts 1 and 2) are posted and due on Monday (3/25)

    • Part 1 of HW 8 pertained to Lectures 15 - 17

    • Part 2 of HW 8 pertains to today’s lecture on Logistic Regression

  • Quiz 2 is Thursday, March 28th

    • There will be an asynchronous option.

    • Practice Questions will be posted this weekend.

  • NEW PACKAGE FOR LOGISTIC REGRESSION: blorr

    • If you are having trouble installing/loading any packages or components of R or RStudio, please come to office hour or make an appointment with me.

💥 Lecture 18 In-class Exercises - Q1 - Review 💥

Session ID: bua345s24

Review Question from Lecture 17 and HW 8.

In Lecture 17, we discussed three fit statistics, measures of how well the model ‘fits’ the data.

Fill in the blank(s) using the correct choice. Ideally our chosen model has the

  • ____ Adjusted \(R^2\)

  • ____ Mallow’s C(p)

  • ____ AIC

Models we’ve covered so far

  • Up until now in MAS 261 and BUA 345:

    • All of our regression models (SLR and MLR) have had a response, Y, that was QUANTITATIVE and ideally normal (or transformed to be normal):

      • amount of sleep

      • selling price

      • natural log of insurance charges

      • etc.

How Logistic Regression is different

  • Today we’ll model responses (y variables) that are CATEGORICAL and BINARY, such as

    • yes or no

    • survive or not

    • disease or no disease

  • LOGISTIC REGRESSION is used when your response is BINARY (has two categories)

    • This is one type of generalization of the linear model format you’ve already learned.

    • There are MANY other model generalizations for other data types (not covered in BUA 345).

    • These models are called GENERALIZED LINEAR MODELS (GLM):

      • We are relaxing or the assumption that the response, Y is quantitative and normal.

Logistic Regression Models

  • We can’t model the dichotomous (two-category) response directly using the linear model.

  • Instead, we ‘LINK’ our binary response to our predictor variables using a link function.

  • The underlying math is interesting, but not needed for this course.

Probability and Odds

  • In order to understand logistic regression models, it is helpful to know:

    • The differences between Probability and Odds

    • How to convert Probability to Odds and vise versa

    • How to convert Log Odds to probability using exp OR plogis function

  • We cover these concepts FIRST, because:

    • Log Odds, LN(ODDS), is the link (link function)

    • This function LINKS our two category response, Y, to our predictor (X) variables:

  • If we understand this link, then we can understand and interpret Logistic Regression analyses.

Probability and Odds are NOT the same!

Odds are often used incorrectly!

This news item indicates that the PROBABILITY of survival is 95.7%

Odds of Survival \(≠\) Prob. of Survival = P(Survival)


\[ Odds(Survival) = \frac{P(Survival)}{1-P(Survival)} = ? \]


Note: This question is in HW 8

Dice Example

The probability of any side of the die is 1/6:

\[P(1) = P(2) = P(3) = P(4) = P(5) = P(6) = 1/6 \]


The Odds of any side, e.g. rolling a 2 is as follows:

\[ Odds(2) = \frac{P(2)}{1-P(2)} = \frac{\frac{1}{6}}{1-\frac{1}{6}} = \frac{\frac{1}{6}}{\frac{5}{6}} = \frac{1}{5} \]

💥 Lecture 18 In-class Exercises - Q2 and Q3💥

Session ID: bua345s24

Question 2

The probability(p) of rolling a 3 or higher with a single die is 4/6 or 2/3, i.e., \(P(3,4,5,6) = \frac{2}{3}\).

What are the ODDS of rolling a 3 or higher?

Hint: It is possible for odds to be greater than 1.


Question 3

If you flip a fair coin, the probability of heads = probability of tails = 1/2, i.e., \(P(Heads) = P(Tails) = \frac{1}{2}\)

What are the ODDS of a coin landing on heads?

Odds and Betting

  • Horse-racing and betting are where people commonly hear the term odds.

  • For example, a horse in a race has 9 to 1 odds of winning.

  • What is the probability this horse will win?

How do we calculate probability from odds?

\[ Probability = \frac{Odds}{1 + Odds} \]

9 to 1 odds means Odds = \(\frac{1}{9}\)

\[ P(Win) = \frac{\frac{1}{9}}{1 + \frac{1}{9}} = \frac{\frac{1}{9}}{\frac{10}{9}} = \frac{1}{10} = 0.1 \]

Conclusion: A horse with 9 to 1 odds of winning has a probability of 0.1 or a 10% chance of winning the horse race.

💥 Lecture 18 In-class Exercises - Q4 💥

Session ID: bua345s24

In Texas Hold’em, each player get’s two face down cards in their ‘pocket’. There are also three face up cards that everyone can use to make their best hand.

The odds of getting a ‘Pocket Pair’, a pair of two cards of the same value in the face down cards is 16 to 1.

\[Odds(Pocket Pair) = \frac{1}{16}\]

What is the probability of a pocket pair? Round your answer to 3 decimal places

Recall:

\[Probability = \frac{Odds}{1 + Odds}\]

Why Odds are useful to Logistic Regression

  • Probability is more intuitive

  • Odds, and more specifically, LN(Odds) are the magic LINK between a binary response and our predictor variables.

  • When we do Logistic Regression, the estimated response will be LN(Odds)

  • Just like we back transform LN(Y) to get Y, we can convert LN(Odds) to probabilities.

Why back-transform Log-odds

  • We DON’T want to know:

    • the estimated log odds of winning a game
    • the estimated log odds that it will rain tomorrow
  • We DO want to know:

    • the estimated PROBABILITY of winning a game
    • the estimated PROBABILITY that it will rain tomorrow

Back-Transforming Log-Odds in Excel and R

We can convert the estimated log odds to a probability in Excel or R:

  • Y’ = Estimated Log Odds = Estimate of \(LN(\frac{P}{1-P})\) from model

  • Y’ is the model estimate that we want to convert to a probability

  • Est. Probability, \(P = \frac{e^{Y'}}{1+e^{Y'}}\)

  • This can be calculated in Excel or R using the exp function:

  • This conversion can be done more simply in R with the plogis function

💥 Lecture 18 In-class Exercises - Q5 💥

Session ID: bua345s24

Converting Log Odds:

  • A logistic regression model predicts the log odds that a small business owner will be audited by the IRS based on relevant predictor variables.

  • Based on this model, a cafe in Syracuse determines that the estimated log odds of being audited in 2024 is -1.946, Y’ = -1.946

If Estimated Log Odds = Y’ = -1.946, what is the probability that this cafe will be audited?

Recall: Est. Probability, \(P = \frac{e^{Y'}}{1+e^{Y'}}\)

# ln_odds <- 

# using exp function (same as excel)

# using plogis function (only in R)

Logistic Regression in R

The following examples will show how logistic regression uses LN(Odds) as a link function to model a two category (binary) response.

  • The model estimates, LN(Odds) can be converted to probabilities

  • For HWs and quizzes and the final exam you will be expected to:

    • calculate odds from probabilities.

    • calculate probabilities from odds.

    • estimate probabilities from log odds estimates from a logistic regression model.

Titanic data

More complete version of Titanic passenger data than we worked with before.

All data are CATEGORICAL:

  • Response, Y, is Survived: Yes or No

  • Predictors, X variables, are:

    • Class (of Passenger ticket): First, Second, Third , Crew

    • Age (Category): Adult, Child

    • Gender: Male, Female

What is the estimated probability of survival based on gender, age category, and passenger class?

Verifying Data for Logistic Regression

In order to use Logistic Regression the following conditions must be met:

  • Each category of each predictor variable has more than ONE observation in each response category

  • MOST categories have more than FIVE observations in each response category

# import and examine data
titanic <- read_csv("data/titanic.csv", show_col_types = F) |>
  glimpse(width=60)
Rows: 2,201
Columns: 4
$ Class    <chr> "Crew", "First", "Crew", "Third", "First"…
$ Age      <chr> "Adult", "Adult", "Adult", "Adult", "Adul…
$ Gender   <chr> "Male", "Female", "Male", "Male", "Male",…
$ Survived <chr> "Yes", "Yes", "No", "No", "No", "No", "No…

Verifying Data for Logistic Regression

  • These tables show that all categories of each predictor variable have some ‘Yes’ and some ‘No’ observations.
  • All categories are represented in both categories of response, so all predictors can be used in model.
titanic |> select(Survived, Class) |> table()   # examine data by class category
        Class
Survived Crew First Second Third
     No   673   122    167   528
     Yes  212   203    118   178
titanic |> select(Survived, Age) |> table()     # examine data by age category
        Age
Survived Adult Child
     No   1438    52
     Yes   654    57
titanic |> select(Survived, Gender) |> table()  # examine data by gender
        Gender
Survived Female Male
     No     126 1364
     Yes    344  367

Specifying the Logistic Regression Model

The command we use in R, glm is used for Generalized Linear Models.

  • For BUA 345, you are expected to interpret the coefficients from a logistic regression model.

    • Use provided code to get model estimates, LN(Odds)

    • Convert Estimated LN(Odds) to probabilities

    • In more advanced analytics courses, we will talk about model fit and validation.

Steps for Specify and Displaying Model


  1. Create a glm model named titanic_logistic

    • family=binomial(link = 'logit') specifies the model has a binary (two-category) response


  1. Output model results using blr_regress command from blorr package.

    • If blorr package doesn’t work, an alternative command is summary

R code to Specify and Summarize Titanic Model

titanic <- titanic|> mutate(SurvivedF = factor(Survived))         # create factor variable

titanic_logistic <- glm(SurvivedF ~ Class + Age + Gender, data=titanic, # specify model
                        family=binomial(link = 'logit'))

blr_regress(titanic_logistic) # examine model output
# summary(titanic_logistic)   # alternative if bls_regress doesn't work

Part 1 of Output (Not Essential for BUA 345)

R code to Specify and Summarize Titanic Model

blr_regress(titanic_logistic) # examine model output

Part 2 of Output (Only Use Max. Likelihood Estimates in BUA 345)

  • Estimate column shows the estimated beta coefficients for our model.

    • We will use these beta coefficients in Excel to show how regression estimates of log odds and probability are calculated for each observation (passenger) in the data.
  • Pr(>|z|) column shows the P-value for each beta coefficient.

Add Log Odds and Probabilities to Titanic Data

titanic <- titanic |> # create a new variable that shows log odds of survival for each passenger
  mutate(Log_Odds_Survival = titanic_logistic |> glm(titanic, family = binomial) |>  predict.glm() |> round(4))
         
# convert log odds to probabilities using plogis command
titanic_table <- titanic |> mutate(Prob_Survival = Log_Odds_Survival |> plogis() |> round(4)) |>
  select(!SurvivedF)

head(titanic_table) |> kable()
Class Age Gender Survived Log_Odds_Survival Prob_Survival
Crew Adult Male Yes -1.2339 0.2255
First Adult Female Yes 2.0438 0.8853
Crew Adult Male No -1.2339 0.2255
Third Adult Male No -2.1540 0.1040
First Adult Male No -0.3762 0.4070
Crew Adult Male No -1.2339 0.2255

Titanic Model Interpretation

We can use an Excel spreadsheet (like we did for MLR) to estimate log odds and probabilities for each category

  • For example:

    • What is the probability that a Female Adult Crew member survived?

    • What is the probability that a Male Adult in Third Class survived?

    • What is the probability that a Male Child in First Class survived?

  • Recall:

    • Baseline categories first alphabetically) are not shown.

    • Baseline Class: Crew

    • Baseline Age category: Adult

    • Baseline Gender = Female

💥 Lecture 18 In-class Exercises - Q6 and Q7 💥

Session ID: bua345s24

Question 6

Based on the Titanic Logistic Regression Model, what is the probability that a Male Adult in Third Class survived?

Use provided worksheet to do calculation. Round answer to 3 decimal places.


Question 7

Based on the Titanic Logistic Regression Model, what is the probability that a Male Child in First Class survived?

Use provided worksheet to do calculation. Round answer to 3 decimal places.

Two Plots for Context

  • The titanic data are commonly used because the differences between class and gender are so clear.
  • When looking at probabilities by category, it is also important to look at numbers of observations in each category.

Late Payment Data

Logistic Regression with Quantitative Predictors

  • Data from the 7,550 consumers is used to create a logistic regression model

  • The goal is to estimate the probability of late payment on a credit card

  • Predictor (X) variables are:

    • Age
    • Number of Dependents
    • Debt ratio = Total Debt/Total Assets

Logistic Regression with Quantitative Predictors

  • All three variables are QUANTITATIVE
# import data and examine using glimpse
late_payment <- read_csv("data/LatePayment.csv", 
                         show_col_types = F) |>
  glimpse(width=40)
Rows: 7,550
Columns: 4
$ Age          <dbl> 57, 34, 42, 34, 6…
$ Dependents   <dbl> 0, 0, 0, 0, 0, 0,…
$ Debt_Ratio   <dbl> 0.39, 0.51, 0.48,…
$ Late_Payment <chr> "No", "Yes", "No"…

Correlation Matrix - No Multicollinearity

All correlations between X variables are \(\lt 0.8\)

# examine correlation matrix
late_payment |> select(Age:Debt_Ratio) |>
  cor() |> round(2) |> kable()
Age Dependents Debt_Ratio
Age 1.00 -0.19 -0.08
Dependents -0.19 1.00 0.09
Debt_Ratio -0.08 0.09 1.00

Steps to Specify and Summarize Late Payment

  1. Create a glm model named latepmt_logistic

  2. Output model results using blr_regress (blorr package) or summary

late_payment <- late_payment |> mutate(Late_PaymentF = factor(Late_Payment))  # create factor variable
latepmt_logistic <- glm(Late_PaymentF ~Age + Dependents + Debt_Ratio,         # specify model
                        data=late_payment, family=binomial(link = 'logit'))
blr_regress(latepmt_logistic) # examine model output

Part 1 of Output (Not Essential for BUA 345)

R code to Specify and Summarize Late Payment Model

blr_regress(latepmt_logistic) # examine model output

Part 2 of Output (Only Use Max. Likelihood Estimates in BUA 345)

  • Estimate column shows the estimated beta coefficients for our model.

    • We will use these beta coefficients in Excel to show how regression estimates of log odds and probability are calculated for each observation (consumer) in the data.
  • Pr(>|z|) column shows the P-value for each beta coefficient.

Add Log Odds and Probabilities to Late Payment Data

late_payment <- late_payment |> # create a new variable that shows log odds of survival for each consumer
  mutate(Log_Odds_LP = latepmt_logistic |> glm(late_payment, family = binomial) |> predict.glm() |> round(4))
         
# convert log odds to probabilities using plogis command
late_payment_table <- late_payment |> mutate(Prob_LP = Log_Odds_LP |> plogis() |> round(4)) |>
  select(!Late_PaymentF)

head(late_payment_table) |> kable()
Age Dependents Debt_Ratio Late_Payment Log_Odds_LP Prob_LP
57 0 0.39 No -1.6849 0.1564
34 0 0.51 Yes -1.3017 0.2139
42 0 0.48 No -1.4215 0.1944
34 0 0.24 No -1.6111 0.1664
66 0 0.15 No -2.0561 0.1134
53 0 0.11 No -1.9631 0.1231

💥 Lecture 18 In-class Exercises - Q8 💥

Session ID: bua345s24

Late Payment Model Interpretation

We can use an Excel spreadsheet (like we did for MLR) to estimate log odds and probabilities for each category

  • For example:

    • What is the probability of a late payment for a 30 year old with 1 dependent and a debt ratio of 0.3?

    • What would be the Percent Change in debt ratio if another 30 year old had 4 dependents and debt ratio of 0.5? (HW 8)

  • Use provided worksheet to do calculation. Round answer to 3 decimal places.

HW 8 - Part 2 and Upcoming Dates

  • HW 7 was due Yesterday (3/20)

  • HW 8 (Parts 1 and 2) are posted and due on Monday (3/25)

    • Part 1 of HW 8 pertained to Lectures 15 - 17

    • Part 2 of HW 8 pertains to today’s lecture on Logistic Regression

  • Quiz 2 is Thursday, March 28th

    • There will be an asynchronous option.

    • Practice Questions will be posted this weekend.

Key Points from Today

  • Logistic Regression is useful for predicting outcomes

    • Helpful in decision making - provides probabilities

    • Underlying math of GLM is more complex

    • Software (such as R) allows students to bypass math and focus on interpretation

  • Important to understand

    • Difference between Odds and Probabilities

    • How to convert odds to probabilities and vise versa

    • How to convert log odds to probabilities (you can use plogis function in R)

To submit an Engagement Question or Comment about material from Today’s Lecture: Submit by midnight today (day of lecture). Click on Link next to the under today’s lecture.