From Linear Models to Logistic Regression

A Complete Guide for Econometrics & Applied Data Analysis

Author

AS

Published

October 23, 2025

Motivation: Why Move Beyond OLS?

Goal: Transition students from the familiar (OLS) to the new (Logit).

Key Ideas

  • In OLS, \(Y_i\) is continuous; in binary models, \(Y_i \in \{0,1\}\)
  • We’re now modeling probabilities: \(P(Y_i=1|X_i)\)
  • The OLS model (Linear Probability Model, LPM) can handle this—but with major problems

Example

\(Y_i = 1\) if Employed, \(0\) if Not; \(X_i = \text{Years of Education}\)

Discussion Questions

  • What happens if predicted \(Y_i < 0\) or \(> 1\)?
  • Why does OLS assume linearity when probabilities are naturally curved?

Historical Context: OLS → LPM → Logit

Era Method Key Figures Motivation
1800s OLS Gauss & Legendre Fitting continuous data (planetary orbits)
1940s Linear Probability Model (LPM) Early econometricians Simple, intuitive, computationally cheap
1944 Logit introduced Joseph Berkson Bioassay (dose-response data)
1970s MLE becomes mainstream Fisher, McFadden, Heckman Computers enable iterative estimation

Discussion prompt: Why did econometrics stay with OLS for so long even after MLE was discovered?

The Linear Probability Model (LPM)

Model Specification

\[P(Y_i=1|X_i) = \beta_0 + \beta_1 X_i\]

Pros

  • Simple interpretation (\(\beta_1 = \Delta P/\Delta X\))
  • Estimated by OLS

Cons

  • Predicted probabilities < 0 or > 1
  • Constant marginal effects
  • Heteroskedastic errors
  • Inefficient and invalid standard errors

Visual Comparison

# Sample visualization code
library(ggplot2)

# Generate data
set.seed(123)
x <- seq(-5, 5, length.out = 100)
lpm <- 0.5 + 0.1 * x
logit <- 1 / (1 + exp(-x))

# Plot
ggplot() +
  geom_line(aes(x = x, y = lpm, color = "LPM"), linewidth = 1) +
  geom_line(aes(x = x, y = logit, color = "Logit"), linewidth = 1) +
  ylim(c(-0.2, 1.2)) +
  labs(x = "X", y = "P(Y=1)", color = "Model") +
  theme_minimal()
Figure 1: Linear Probability Model vs Logistic S-curve

Introducing the Logit Model

Step 1: Define the Probability Model

\[P_i = P(Y_i = 1 | X_i) = \frac{e^{X_i\beta}}{1 + e^{X_i\beta}}\]

Step 2: Transform to Linear Form

Take log-odds:

\[\log\left( \frac{P_i}{1 - P_i} \right) = X_i\beta\]

Interpretation

  • Left-hand side: log-odds (logit)
  • Right-hand side: linear predictor (same structure as OLS)

Likelihood and Maximum Likelihood Estimation

Likelihood for One Observation

\[ L_i(\beta) = p_i^{Y_i}(1-p_i)^{1-Y_i} \]

The likelihood expresses how plausible the parameter \(\beta\) makes the observed outcome \(Y_i\).
If \(Y_i=1\), it equals \(p_i\); if \(Y_i=0\), it equals \(1-p_i\).

https://en.wikipedia.org/wiki/Bernoulli_distribution


Joint Likelihood (All Observations)

\[ L(\beta) = \prod_{i=1}^{n} p_i^{Y_i}(1-p_i)^{1-Y_i} \]

Assuming observations are independent, we multiply individual likelihoods.
This gives the probability of observing the entire dataset, given the model.


Log-Likelihood

\[ \ell(\beta) = \sum_{i=1}^{n} \left[Y_i \log(p_i) + (1-Y_i)\log(1-p_i)\right] \]

Taking logs converts products into sums and is numerically more stable.
Maximizing the log-likelihood gives the same solution as maximizing the original likelihood.


MLE Objective

\[ \hat{\beta} = \arg\max_\beta \ell(\beta) \]

The MLE chooses the \(\beta\) that makes the observed 0s and 1s most probable.
No closed-form solution exists → we use iterative optimization (Newton–Raphson / IRLS).

MLE in Broader Context

Comparing OLS and MLE

Aspect OLS MLE
Objective Minimize squared errors Maximize likelihood of observed data
Assumptions Normal errors Distribution depends on Y
Works for Continuous Y Any distribution (Bernoulli, Poisson, etc.)
Estimation Closed form Iterative
Historical origin Gauss–Legendre (1800s) Fisher (1920s)
Output \(\hat{\beta}\), residual variance \(\hat{\beta}\), log-likelihood, standard errors

Interpretation in Logit

Key Interpretations

  • \(\beta_j\): Change in log-odds for a one-unit change in \(X_j\)
  • \(e^{\beta_j}\): Odds ratio
  • Marginal effects: Change in predicted probability (computed at mean X)

Example

If \(e^{\beta_1} = 1.5\):

→ Each extra year of education multiplies odds of being employed by 1.5

Model Assessment

Tool Purpose
Confusion matrix Classify 0/1 outcomes
Accuracy, Precision, Recall, F1 Performance metrics
ROC curve, AUC Tradeoff between true/false positives
Pseudo-R² Goodness of fit (e.g., McFadden R²)

Summary Comparison

Concept OLS LPM Logit
Outcome Continuous Binary Binary
Estimation Least Squares Least Squares Maximum Likelihood
Range of fitted values (−∞, ∞) (0, 1)
Efficiency BLUE Inefficient MLE (efficient)
Interpretation Marginal change in Y Marginal change in P Change in log-odds / odds ratio
Common use Linear models Simple binary outcomes Modern discrete choice / binary models

Example Code

# Load data
?mtcars

# 1. Linear Probability Model
lpm <- lm(am ~ mpg + cyl + disp + gear + vs, 
          data = mtcars
          )
summary(lpm)

Call:
lm(formula = am ~ mpg + cyl + disp + gear + vs, data = mtcars)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.56766 -0.13390  0.00638  0.16812  0.51349 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -1.0862077  0.8791623  -1.236 0.227691    
mpg          0.0307034  0.0167416   1.834 0.078131 .  
cyl         -0.0422230  0.0840822  -0.502 0.619779    
disp        -0.0004265  0.0010237  -0.417 0.680371    
gear         0.3810048  0.0867819   4.390 0.000168 ***
vs          -0.3878550  0.1813004  -2.139 0.041970 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.2749 on 26 degrees of freedom
Multiple R-squared:  0.7454,    Adjusted R-squared:  0.6965 
F-statistic: 15.23 on 5 and 26 DF,  p-value: 5.09e-07
# 2. Logistic Regression
logit <- glm(am ~ mpg + cyl + disp + gear + vs, 
             data = mtcars,
             family = binomial(link = "logit")
             )
Warning: glm.fit: algorithm did not converge
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(logit)

Call:
glm(formula = am ~ mpg + cyl + disp + gear + vs, family = binomial(link = "logit"), 
    data = mtcars)

Coefficients:
              Estimate Std. Error z value Pr(>|z|)
(Intercept)    225.636 412601.332   0.001    1.000
mpg             -6.290   7069.826  -0.001    0.999
cyl              5.619  38279.009   0.000    1.000
disp            -1.677   1188.168  -0.001    0.999
gear            88.120  65804.055   0.001    0.999
vs            -242.077 198825.189  -0.001    0.999

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 4.3230e+01  on 31  degrees of freedom
Residual deviance: 3.8107e-09  on 26  degrees of freedom
AIC: 12

Number of Fisher Scoring iterations: 25
# 3. Compare predictions
mtcars$pred_lpm <- predict(lpm)
mtcars$pred_logit <- predict(logit, type = "response")


# 4. Marginal effects (example for education)
library(margins)
?margins
margins(model = logit, 
        data = find_data(logit)
        )
Average marginal effects
glm(formula = am ~ mpg + cyl + disp + gear + vs, family = binomial(link = "logit"),     data = mtcars)
        mpg       cyl       disp      gear         vs
 -3.748e-10 3.346e-10 -1.009e-10 5.264e-09 -1.443e-08

— Compare Predictions & Evaluate Model Performance —

# Predicted probabilities
mtcars$pred_lpm   <- predict(lpm)
mtcars$pred_logit <- predict(logit, type = "response")

# Clamp LPM predictions to [0,1] so they can be treated as probabilities
mtcars$pred_lpm <- pmin(pmax(mtcars$pred_lpm, 0), 1)

# --- 1. Confusion Matrices ---
# Threshold (you can adjust)
thresh <- 0.5

# Predicted classes
mtcars$pred_lpm_class   <- ifelse(mtcars$pred_lpm   >= thresh, 1, 0)
mtcars$pred_logit_class <- ifelse(mtcars$pred_logit >= thresh, 1, 0)

# Confusion matrix for LPM
cat("\n=== Confusion Matrix: Linear Probability Model ===\n")

=== Confusion Matrix: Linear Probability Model ===
table(Predicted = mtcars$pred_lpm_class, Actual = mtcars$am)
         Actual
Predicted  0  1
        0 17  1
        1  2 12
# Confusion matrix for Logit
cat("\n=== Confusion Matrix: Logistic Regression ===\n")

=== Confusion Matrix: Logistic Regression ===
table(Predicted = mtcars$pred_logit_class, Actual = mtcars$am)
         Actual
Predicted  0  1
        0 19  0
        1  0 13
# --- 2. Accuracy, Precision, Recall (for reference) ---
library(caret)
Loading required package: lattice
cat("\n=== LPM Classification Report ===\n")

=== LPM Classification Report ===
confusionMatrix(as.factor(mtcars$pred_lpm_class), as.factor(mtcars$am), positive = "1")
Confusion Matrix and Statistics

          Reference
Prediction  0  1
         0 17  1
         1  2 12
                                          
               Accuracy : 0.9062          
                 95% CI : (0.7498, 0.9802)
    No Information Rate : 0.5938          
    P-Value [Acc > NIR] : 0.000105        
                                          
                  Kappa : 0.808           
                                          
 Mcnemar's Test P-Value : 1.000000        
                                          
            Sensitivity : 0.9231          
            Specificity : 0.8947          
         Pos Pred Value : 0.8571          
         Neg Pred Value : 0.9444          
             Prevalence : 0.4062          
         Detection Rate : 0.3750          
   Detection Prevalence : 0.4375          
      Balanced Accuracy : 0.9089          
                                          
       'Positive' Class : 1               
                                          
cat("\n=== Logit Classification Report ===\n")

=== Logit Classification Report ===
confusionMatrix(as.factor(mtcars$pred_logit_class), as.factor(mtcars$am), positive = "1")
Confusion Matrix and Statistics

          Reference
Prediction  0  1
         0 19  0
         1  0 13
                                     
               Accuracy : 1          
                 95% CI : (0.8911, 1)
    No Information Rate : 0.5938     
    P-Value [Acc > NIR] : 5.693e-08  
                                     
                  Kappa : 1          
                                     
 Mcnemar's Test P-Value : NA         
                                     
            Sensitivity : 1.0000     
            Specificity : 1.0000     
         Pos Pred Value : 1.0000     
         Neg Pred Value : 1.0000     
             Prevalence : 0.4062     
         Detection Rate : 0.4062     
   Detection Prevalence : 0.4062     
      Balanced Accuracy : 1.0000     
                                     
       'Positive' Class : 1          
                                     
# --- 3. ROC Curves and AUC ---
library(pROC)
Type 'citation("pROC")' for a citation.

Attaching package: 'pROC'
The following objects are masked from 'package:stats':

    cov, smooth, var
# ROC curve objects
roc_lpm   <- roc(mtcars$am, mtcars$pred_lpm)
Setting levels: control = 0, case = 1
Setting direction: controls < cases
roc_logit <- roc(mtcars$am, mtcars$pred_logit)
Setting levels: control = 0, case = 1
Setting direction: controls < cases
# Plot ROC curves
plot(roc_lpm, col = "blue", lwd = 2, main = "ROC Curve: LPM vs Logit",
     legacy.axes = TRUE)
plot(roc_logit, col = "red", lwd = 2, add = TRUE)
abline(a = 0, b = 1, lty = 2, col = "gray")

legend("bottomright",
       legend = c(
         paste("LPM  (AUC =", round(auc(roc_lpm), 3), ")"),
         paste("Logit (AUC =", round(auc(roc_logit), 3), ")")
       ),
       col = c("blue", "red"), lwd = 2)

# --- Optional: Display AUC values ---
cat("\nAUC (LPM):",  round(auc(roc_lpm), 3))

AUC (LPM): 0.988
cat("\nAUC (Logit):", round(auc(roc_logit), 3))

AUC (Logit): 1
  • Confusion Matrix:
    Counts correct vs incorrect classifications at a cutoff (here 0.5).

  • ROC Curve:
    Plots True Positive Rate (Sensitivity) vs False Positive Rate (1 − Specificity)
    across all possible cutoffs — so it shows how well the model discriminates overall.

  • AUC (Area Under Curve):
    A summary measure of model performance.

    • 0.5 → random guessing
    • 1.0 → perfect prediction
    • Logit should yield higher AUC than LPM.

Extension Topics

Optional Advanced Material

  • Probit model: Logit’s sibling (Normal CDF instead of logistic)
  • Multinomial Logit: Categorical Y with >2 outcomes
  • Conditional Logit: McFadden’s model for discrete choice (transport, job, etc.)
  • Log-likelihood ratio tests, AIC/BIC

Closing Thought

“OLS taught us how to estimate a mean.
MLE and Logit taught us how to estimate a probability—
and how to think in terms of likelihood, not just distance from a line.”