Different ways to interpret logistic regression results

Interpreting the results of a logistic regression model involves understanding how the predictor variables affect the probability of the outcome event. Here are some common ways to interpret logistic regression results:

Coefficients and Odds Ratios

Coefficients: These represent the change in the log-odds of the outcome for a one-unit change in the predictor variable. A positive coefficient indicates that an increase in the predictor variable increases the likelihood of the outcome, while a negative coefficient suggests a decrease.
Odds Ratios: These are calculated by exponentiating the coefficients (e.g., $e^{\beta}$). They represent the multiplicative change in the odds of the outcome for a one-unit change in the predictor. An odds ratio greater than 1 indicates an increase in the odds, while an odds ratio less than 1 indicates a decrease.

Effects Plots

Plotting the predicted probabilities against the predictor variables can help visualize the relationship between them.

General Example:

Let’s say you have a logistic regression model predicting the likelihood of a customer purchasing a product based on their age and income.

Coefficient for age: 0.05
Odds ratio for age: $e^{0.05} = 1.05$

This means that for every one-year increase in age, the odds of purchasing the product increase by a factor of 1.05, or 5% ($100 * (e^{\beta} - 1)$).

Important Considerations:

Correlation vs. Causation: Logistic regression can only show associations between variables, not causal relationships.
Linearity Assumption: The relationship between the log-odds of the outcome and the predictor variables is assumed to be linear.
Multicollinearity: High correlation between predictor variables can affect the interpretation of the coefficients. By considering these different aspects, you can gain a comprehensive understanding of the results of your logistic regression model and draw meaningful conclusions from your data.

Example: Predicting Student Success

A logistic regression model predicts the likelihood of a student passing a course based on the number of hours studied and whether they attended a review session (yes/no).

# Set seed for reproducibility
set.seed(42)

# Number of observations
n <- 500

# Generate random data for hours studied (uniform distribution)
hours_studied <- runif(n, min = 1, max = 20)

# Generate random data for review session attendance (categorical)
review_session <- sample(c("yes", "no"), n, replace = TRUE, prob = c(0.6, 0.4))

# Create dummy variable for review session
review_attended <- ifelse(review_session == "yes", 1, 0)

# Generate probabilities for passing based on hours studied and review session
# (using coefficients similar to our previous example)
log_odds <- 0.2 * hours_studied + 1.5 * review_attended
probability <- 1 / (1 + exp(-log_odds))

# Generate pass outcome (0 or 1) based on probabilities
pass <- rbinom(n, 1, probability)

# Create a data frame
df <- data.frame(hours_studied, review_session, review_attended, pass)
df$review_attended<-as.factor(df$review_attended)
df$pass<-as.factor(df$pass)

# Fit logistic regression model
model <- glm(pass ~ hours_studied + review_attended, data = df, family = "binomial")

# Print model summary
summary(model)

## 
## Call:
## glm(formula = pass ~ hours_studied + review_attended, family = "binomial", 
##     data = df)
## 
## Coefficients:
##                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)       0.55444    0.34120   1.625 0.104168    
## hours_studied     0.15572    0.03801   4.097 4.19e-05 ***
## review_attended1  1.25495    0.36097   3.477 0.000508 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 273.86  on 499  degrees of freedom
## Residual deviance: 239.14  on 497  degrees of freedom
## AIC: 245.14
## 
## Number of Fisher Scoring iterations: 6

Odds Ratios

`hours studied` (quantitative)

Coefficient for hours studied: 0.156
Odds ratio for hours studied: $e^{0.156} = 1.168$

This means that for every additional hour studied, the odds of passing the course increase by a factor of 1.168, or 16%.

`review attended` (qualitative)

Coefficient for review attended: 1.25
Odds ratio for review attended: $e^{1.25} = 3.49

This means that students who attended the review session have 3.49 times the odds of passing compared to those who did not.

Effects Plots

# Call necessary libraries
library(ggeffects)
library(cowplot)

## 
## Attaching package: 'cowplot'

## The following object is masked from 'package:ggeffects':
## 
##     get_title

# Obtain predicted probabilities for each term separately
prediction_hours<-ggpredict(model, terms="hours_studied [all]")
prediction_review<-ggpredict(model, terms="review_attended")

# Plot each term individually
plot_hours<-plot(prediction_hours)
plot_review<-plot(prediction_review)

# Combine plots into a single figure
plot_grid(plot_hours, plot_review)

Example: Predicting Loan Default

A bank uses a logistic regression model to predict the likelihood of a customer defaulting on a loan based on their credit score and income.

# Set seed for reproducibility
set.seed(999)

# Number of observations
n <- 6000

# Generate random data for credit score (normally distributed)
credit_score <- rnorm(n, mean = 650, sd = 50)

# Generate random data for income (log-normal distribution)
income <- exp(rnorm(n, mean = 10, sd = 1)) 

# Generate probabilities for default based on credit score and income
# (using adjusted coefficients and a base log-odds)
base_log_odds <- 40  # Add a base log-odds to shift the probabilities
log_odds <- base_log_odds - 0.01 * credit_score - 0.005 * income 
probability <- 1 / (1 + exp(-log_odds))

# Generate default outcome (0 or 1) based on probabilities
default <- factor(rbinom(n, 1, probability))
levels(default)=c("No", "Yes")

# Create a data frame
df <- data.frame(credit_score, income, default)

# Plot it for fun
library(ggplot2)
ggplot(df, aes(y = default, x = credit_score)) +
  geom_jitter(width = 0.05, height = 0.05, alpha = 0.5) +
  theme_classic(base_size = 15)

# Fit logistic regression model
model <- glm(default ~ credit_score + income, data = df, family = "binomial")

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

# Print model summary
summary(model)

## 
## Call:
## glm(formula = default ~ credit_score + income, family = "binomial", 
##     data = df)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  42.082308   4.698374   8.957  < 2e-16 ***
## credit_score -0.012690   0.003809  -3.331 0.000864 ***
## income       -0.005031   0.000472 -10.659  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 4391.13  on 5999  degrees of freedom
## Residual deviance:  232.48  on 5997  degrees of freedom
## AIC: 238.48
## 
## Number of Fisher Scoring iterations: 15

Odds Ratios:

`credit_score`

Coefficient for credit score: -0.012
Odds ratio for credit score: $e^{-0.012} = 0.99$

This means that for every one-point increase in credit score, the odds of defaulting decrease by a factor of 0.99, or 1%.

`income`

Coefficient for income: -0.005
Odds ratio for income: $e^{-0.005} = 0.9950125$

This means that for every one-dollar increase in income, the odds of defaulting decrease by a very small factor of 0.995.

Effects Plots

# Call necessary libraries
library(ggeffects)
library(cowplot)

# Obtain predicted probabilities for each term separately
prediction_credit<-ggpredict(model, terms="credit_score [all]") # using [all] gets smooth plots.
prediction_income<-ggpredict(model, terms="income [all]")

# Plot each term individually
plot_credit<-plot(prediction_credit)
plot_income<-plot(prediction_income)

# Combine plots into a single figure
plot_grid(plot_credit, plot_income)

Example: Predicting Click-Through Rates

An online advertiser uses logistic regression to predict the likelihood of a user clicking on an ad based on the ad’s position on the page and the user’s age.

# Set seed for reproducibility
set.seed(777)

# Number of observations
n <- 2000

# Generate random data for ad position (uniformly distributed)
ad_position <- sample(1:5, n, replace = TRUE)  # Assuming 5 ad positions

# Generate random data for user age (normally distributed)
user_age <- rnorm(n, mean = 30, sd = 8)

# Generate probabilities for click-through based on ad position and user age
# (using coefficients similar to our previous example)
log_odds <- -0.5 * ad_position + 0.02 * user_age 
probability <- 1 / (1 + exp(-log_odds))

# Generate click-through outcome (0 or 1) based on probabilities
click <- as.factor(rbinom(n, 1, probability))

# Create a data frame
df <- data.frame(ad_position, user_age, click)
df$ad_position<-as.factor(df$ad_position)

# Fit logistic regression model
model <- glm(click ~ ad_position + user_age, data = df, family = "binomial")

# Print model summary
summary(model)

## 
## Call:
## glm(formula = click ~ ad_position + user_age, family = "binomial", 
##     data = df)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -0.583076   0.215397  -2.707  0.00679 ** 
## ad_position2 -0.310708   0.139656  -2.225  0.02609 *  
## ad_position3 -0.764950   0.147841  -5.174 2.29e-07 ***
## ad_position4 -1.351590   0.164504  -8.216  < 2e-16 ***
## ad_position5 -1.615061   0.173516  -9.308  < 2e-16 ***
## user_age      0.014858   0.006318   2.352  0.01869 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2450.2  on 1999  degrees of freedom
## Residual deviance: 2302.8  on 1994  degrees of freedom
## AIC: 2314.8
## 
## Number of Fisher Scoring iterations: 4

Odds Ratios:

`ad position`

Coefficient for ad position 2: -0.31
Odds ratio for ad position: $e^{-0.31} = 0.733447$

This means that the odds of clicking on the ad in position 2 are 0.73 times the odds of clicking an ad in position 1. This is equivalent to saying the odds of clicking on the ad are 27% lower for an ad in position 2 compared to position 1

`user age`

Coefficient for user age: 0.015
Odds ratio for user age: $e^{0.015} = 1.015$

This means that for every one-year increase in user age, the odds of clicking on the ad increase by a factor of 1.02, or 2%.

Effects Plots

# Obtain predicted probabilities for each term separately
prediction_ad<-ggpredict(model, terms="ad_position") # using [all] gets smooth plots.
prediction_age<-ggpredict(model, terms="user_age [all]")

# Plot each term individually
plot_ad<-plot(prediction_ad)
plot_age<-plot(prediction_age)

# Combine plots into a single figure
plot_grid(plot_ad, plot_age)

Interpreting Logistic Regression Coefficients

Different ways to interpret logistic regression results

General Example:

Important Considerations:

Example: Predicting Student Success

Odds Ratios

hours studied (quantitative)

review attended (qualitative)

Effects Plots

Example: Predicting Loan Default

Odds Ratios:

credit_score

income

Effects Plots

Example: Predicting Click-Through Rates

Odds Ratios:

ad position

user age

Effects Plots

`hours studied` (quantitative)

`review attended` (qualitative)

`credit_score`

`income`

`ad position`

`user age`