2024-11-15

Introduction

Logistic Regression is a statistical method for modeling the probability of a binary outcome based on one or more predictor variables.

Mathematically: \[ P(y = 1|X) = \frac{1}{1 + e^{-(\beta_0 + \beta_1 X)}} \]

Use Case in Statistics

  • Predicting binary outcomes
  • Example: Predicting if a patient has a disease (1 = Yes, 0 = No)
  • Common applications:
    • Medicine (e.g., diagnosis predictions)
    • Marketing (e.g., predicting if a customer will buy a product)
    • Machine Learning (e.g., classification problems)

Dataset Overview

We will use the mtcars dataset and modify it for a binary classification problem.

# Create binary outcome: mpg greater than median
mtcars$mpg_binary <- ifelse(mtcars$mpg > median(mtcars$mpg), 1, 0)
summary(mtcars$mpg_binary)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.4688  1.0000  1.0000

Scatterplot Code with Logistic Regression Fit

# Scatterplot with logistic fit
ggplot(mtcars, aes(x = wt, y = mpg_binary)) +
  geom_jitter(width = 0.1, height = 0.1, color = "darkblue", size = 3) +
  stat_smooth(method = "glm", method.args = list(family = "binomial"), color = "blue") +
  labs(title = "Scatterplot of Weight vs Binary MPG",
       x = "Weight of the car",
       y = "Probability of High MPG") +
  theme(plot.title = element_text(size = 16),
        axis.title = element_text(size = 14),
        axis.text = element_text(size = 12))

Scatter plot

Model Summary

We fit a logistic regression model to predict the binary outcome (mpg_binary) using weight (wt).

The model is expressed as: \[ P(y = 1|X) = \frac{1}{1 + e^{-(\beta_0 + \beta_1 X)}} \]

Where: - \(P(y = 1|X)\): Probability of a high MPG. - \(\beta_0\): Intercept term. - \(\beta_1\): Coefficient for wt (weight of the car). - \(X\): Predictor variable (weight of the car).

For every unit increase in weight, the odds ratio changes by: \[ e^{\beta_1} \]

Model Summary

## 
## Call:
## glm(formula = mpg_binary ~ wt, family = "binomial", data = mtcars)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)
## (Intercept)    65.81      44.45    1.48    0.139
## wt            -20.35      13.85   -1.47    0.142
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 44.2363  on 31  degrees of freedom
## Residual deviance:  5.2953  on 30  degrees of freedom
## AIC: 9.2953
## 
## Number of Fisher Scoring iterations: 10

Residual Plot (ggplot2)

Correlation Heatmap

Conclusion

Key Insights:

Weight (wt) has a strong negative relationship with the probability of achieving high miles per gallon (mpg_binary). Logistic regression provides a robust method for modeling binary outcomes.

Next Steps:

Add more predictors (e.g., horsepower, number of cylinders) for a multiple logistic regression analysis. Validate the model using cross-validation to assess performance on unseen data.