From Linear Models to Logistic Regression

A Complete Guide for Econometrics & Applied Data Analysis

Author

Published

October 23, 2025

Motivation: Why Move Beyond OLS?

Goal: Transition students from the familiar (OLS) to the new (Logit).

Key Ideas

In OLS, \(Y_i\) is continuous; in binary models, \(Y_i \in \{0,1\}\)
We’re now modeling probabilities: \(P(Y_i=1|X_i)\)
The OLS model (Linear Probability Model, LPM) can handle this—but with major problems

Example

\(Y_i = 1\) if Employed, \(0\) if Not; \(X_i = \text{Years of Education}\)

Discussion Questions

What happens if predicted \(Y_i < 0\) or \(> 1\)?
Why does OLS assume linearity when probabilities are naturally curved?

Historical Context: OLS → LPM → Logit

Era	Method	Key Figures	Motivation
1800s	OLS	Gauss & Legendre	Fitting continuous data (planetary orbits)
1940s	Linear Probability Model (LPM)	Early econometricians	Simple, intuitive, computationally cheap
1944	Logit introduced	Joseph Berkson	Bioassay (dose-response data)
1970s	MLE becomes mainstream	Fisher, McFadden, Heckman	Computers enable iterative estimation

Discussion prompt: Why did econometrics stay with OLS for so long even after MLE was discovered?

The Linear Probability Model (LPM)

Model Specification

\[P(Y_i=1|X_i) = \beta_0 + \beta_1 X_i\]

Pros

Simple interpretation (\(\beta_1 = \Delta P/\Delta X\))
Estimated by OLS

Cons

Predicted probabilities < 0 or > 1
Constant marginal effects
Heteroskedastic errors
Inefficient and invalid standard errors

Visual Comparison

# Sample visualization code
library(ggplot2)

# Generate data
set.seed(123)
x <- seq(-5, 5, length.out = 100)
lpm <- 0.5 + 0.1 * x
logit <- 1 / (1 + exp(-x))

# Plot
ggplot() +
  geom_line(aes(x = x, y = lpm, color = "LPM"), linewidth = 1) +
  geom_line(aes(x = x, y = logit, color = "Logit"), linewidth = 1) +
  ylim(c(-0.2, 1.2)) +
  labs(x = "X", y = "P(Y=1)", color = "Model") +
  theme_minimal()

Figure 1: Linear Probability Model vs Logistic S-curve

Introducing the Logit Model

Step 1: Define the Probability Model

\[P_i = P(Y_i = 1 | X_i) = \frac{e^{X_i\beta}}{1 + e^{X_i\beta}}\]

Step 2: Transform to Linear Form

Take log-odds:

\[\log\left( \frac{P_i}{1 - P_i} \right) = X_i\beta\]

Interpretation

Left-hand side: log-odds (logit)
Right-hand side: linear predictor (same structure as OLS)

Bernoulli Distribution & Link Function

Random Component

\[ Y_i \sim \text{Bernoulli}(p_i) \]

Each \(Y_i\) can take only 0 or 1, with probability of success \(p_i = P(Y_i=1)\).
We assume each observation is independent and follows its own Bernoulli process.

Therefore,

\[ P(Y_i|p_i) = p_i^{Y_i}(1 - p_i)^{1-Y_i} \]

This single expression covers both cases:
if \(Y_i=1\), probability = \(p_i\); if \(Y_i=0\), probability = \(1-p_i\).

Systematic Component

\[ p_i = f(X_i\beta) \]

The probability of success depends on a set of explanatory variables \(X_i\) through a function \(f(\cdot)\).
We choose \(f\) to ensure predicted probabilities always stay between 0 and 1.

Link Function

\[ g(p_i) = X_i\beta \]

The link function \(g(\cdot)\) connects the mean of the Bernoulli variable to a linear predictor.
It transforms probabilities into an unbounded scale that we can model linearly.

where \(g(\cdot)\) is the logit function:

\[ g(p_i) = \log\!\left(\frac{p_i}{1-p_i}\right) \]

The log of odds maps \((0,1)\) → \((-\infty, +\infty)\),
making the relationship linear in parameters: \(\log(\text{odds}) = X_i\beta\).

This makes it a Generalized Linear Model (GLM):

Family: Bernoulli (for binary data)
Link: Logit (maps probabilities to real numbers)
Mean: \(E[Y_i] = p_i\)

So logistic regression is just a GLM with a Bernoulli response and a logit link.

Likelihood and Maximum Likelihood Estimation

Likelihood for One Observation

\[ L_i(\beta) = p_i^{Y_i}(1-p_i)^{1-Y_i} \]

The likelihood expresses how plausible the parameter \(\beta\) makes the observed outcome \(Y_i\).
If \(Y_i=1\), it equals \(p_i\); if \(Y_i=0\), it equals \(1-p_i\).

https://en.wikipedia.org/wiki/Bernoulli_distribution

Joint Likelihood (All Observations)

\[ L(\beta) = \prod_{i=1}^{n} p_i^{Y_i}(1-p_i)^{1-Y_i} \]

Assuming observations are independent, we multiply individual likelihoods.
This gives the probability of observing the entire dataset, given the model.

Log-Likelihood

\[ \ell(\beta) = \sum_{i=1}^{n} \left[Y_i \log(p_i) + (1-Y_i)\log(1-p_i)\right] \]

Taking logs converts products into sums and is numerically more stable.
Maximizing the log-likelihood gives the same solution as maximizing the original likelihood.

MLE Objective

\[ \hat{\beta} = \arg\max_\beta \ell(\beta) \]

The MLE chooses the \(\beta\) that makes the observed 0s and 1s most probable.
No closed-form solution exists → we use iterative optimization (Newton–Raphson / IRLS).

MLE in Broader Context

Comparing OLS and MLE

Aspect	OLS	MLE
Objective	Minimize squared errors	Maximize likelihood of observed data
Assumptions	Normal errors	Distribution depends on Y
Works for	Continuous Y	Any distribution (Bernoulli, Poisson, etc.)
Estimation	Closed form	Iterative
Historical origin	Gauss–Legendre (1800s)	Fisher (1920s)
Output	\(\hat{\beta}\), residual variance	\(\hat{\beta}\), log-likelihood, standard errors

Interpretation in Logit

Key Interpretations

\(\beta_j\): Change in log-odds for a one-unit change in \(X_j\)
\(e^{\beta_j}\): Odds ratio
Marginal effects: Change in predicted probability (computed at mean X)

Example

If \(e^{\beta_1} = 1.5\):

→ Each extra year of education multiplies odds of being employed by 1.5

Model Assessment

Tool	Purpose
Confusion matrix	Classify 0/1 outcomes
Accuracy, Precision, Recall, F1	Performance metrics
ROC curve, AUC	Tradeoff between true/false positives
Pseudo-R²	Goodness of fit (e.g., McFadden R²)

Summary Comparison

Concept	OLS	LPM	Logit
Outcome	Continuous	Binary	Binary
Estimation	Least Squares	Least Squares	Maximum Likelihood
Range of fitted values	ℝ	(−∞, ∞)	(0, 1)
Efficiency	BLUE	Inefficient	MLE (efficient)
Interpretation	Marginal change in Y	Marginal change in P	Change in log-odds / odds ratio
Common use	Linear models	Simple binary outcomes	Modern discrete choice / binary models

Example Code

# Load data
?mtcars

# 1. Linear Probability Model
lpm <- lm(am ~ mpg + cyl + disp + gear + vs, 
          data = mtcars
          )
summary(lpm)


Call:
lm(formula = am ~ mpg + cyl + disp + gear + vs, data = mtcars)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.56766 -0.13390  0.00638  0.16812  0.51349 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -1.0862077  0.8791623  -1.236 0.227691    
mpg          0.0307034  0.0167416   1.834 0.078131 .  
cyl         -0.0422230  0.0840822  -0.502 0.619779    
disp        -0.0004265  0.0010237  -0.417 0.680371    
gear         0.3810048  0.0867819   4.390 0.000168 ***
vs          -0.3878550  0.1813004  -2.139 0.041970 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.2749 on 26 degrees of freedom
Multiple R-squared:  0.7454,    Adjusted R-squared:  0.6965 
F-statistic: 15.23 on 5 and 26 DF,  p-value: 5.09e-07

# 2. Logistic Regression
logit <- glm(am ~ mpg + cyl + disp + gear + vs, 
             data = mtcars,
             family = binomial(link = "logit")
             )

Warning: glm.fit: algorithm did not converge

Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

summary(logit)


Call:
glm(formula = am ~ mpg + cyl + disp + gear + vs, family = binomial(link = "logit"), 
    data = mtcars)

Coefficients:
              Estimate Std. Error z value Pr(>|z|)
(Intercept)    225.636 412601.332   0.001    1.000
mpg             -6.290   7069.826  -0.001    0.999
cyl              5.619  38279.009   0.000    1.000
disp            -1.677   1188.168  -0.001    0.999
gear            88.120  65804.055   0.001    0.999
vs            -242.077 198825.189  -0.001    0.999

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 4.3230e+01  on 31  degrees of freedom
Residual deviance: 3.8107e-09  on 26  degrees of freedom
AIC: 12

Number of Fisher Scoring iterations: 25

# 3. Compare predictions
mtcars$pred_lpm <- predict(lpm)
mtcars$pred_logit <- predict(logit, type = "response")


# 4. Marginal effects (example for education)
library(margins)
?margins
margins(model = logit, 
        data = find_data(logit)
        )

Average marginal effects

glm(formula = am ~ mpg + cyl + disp + gear + vs, family = binomial(link = "logit"),     data = mtcars)

        mpg       cyl       disp      gear         vs
 -3.748e-10 3.346e-10 -1.009e-10 5.264e-09 -1.443e-08

— Compare Predictions & Evaluate Model Performance —

# Predicted probabilities
mtcars$pred_lpm   <- predict(lpm)
mtcars$pred_logit <- predict(logit, type = "response")

# Clamp LPM predictions to [0,1] so they can be treated as probabilities
mtcars$pred_lpm <- pmin(pmax(mtcars$pred_lpm, 0), 1)

# --- 1. Confusion Matrices ---
# Threshold (you can adjust)
thresh <- 0.5

# Predicted classes
mtcars$pred_lpm_class   <- ifelse(mtcars$pred_lpm   >= thresh, 1, 0)
mtcars$pred_logit_class <- ifelse(mtcars$pred_logit >= thresh, 1, 0)

# Confusion matrix for LPM
cat("\n=== Confusion Matrix: Linear Probability Model ===\n")


=== Confusion Matrix: Linear Probability Model ===

table(Predicted = mtcars$pred_lpm_class, Actual = mtcars$am)

         Actual
Predicted  0  1
        0 17  1
        1  2 12

# Confusion matrix for Logit
cat("\n=== Confusion Matrix: Logistic Regression ===\n")


=== Confusion Matrix: Logistic Regression ===

table(Predicted = mtcars$pred_logit_class, Actual = mtcars$am)

         Actual
Predicted  0  1
        0 19  0
        1  0 13

# --- 2. Accuracy, Precision, Recall (for reference) ---
library(caret)

Loading required package: lattice

cat("\n=== LPM Classification Report ===\n")


=== LPM Classification Report ===

confusionMatrix(as.factor(mtcars$pred_lpm_class), as.factor(mtcars$am), positive = "1")

Confusion Matrix and Statistics

          Reference
Prediction  0  1
         0 17  1
         1  2 12
                                          
               Accuracy : 0.9062          
                 95% CI : (0.7498, 0.9802)
    No Information Rate : 0.5938          
    P-Value [Acc > NIR] : 0.000105        
                                          
                  Kappa : 0.808           
                                          
 Mcnemar's Test P-Value : 1.000000        
                                          
            Sensitivity : 0.9231          
            Specificity : 0.8947          
         Pos Pred Value : 0.8571          
         Neg Pred Value : 0.9444          
             Prevalence : 0.4062          
         Detection Rate : 0.3750          
   Detection Prevalence : 0.4375          
      Balanced Accuracy : 0.9089          
                                          
       'Positive' Class : 1

cat("\n=== Logit Classification Report ===\n")


=== Logit Classification Report ===

confusionMatrix(as.factor(mtcars$pred_logit_class), as.factor(mtcars$am), positive = "1")

Confusion Matrix and Statistics

          Reference
Prediction  0  1
         0 19  0
         1  0 13
                                     
               Accuracy : 1          
                 95% CI : (0.8911, 1)
    No Information Rate : 0.5938     
    P-Value [Acc > NIR] : 5.693e-08  
                                     
                  Kappa : 1          
                                     
 Mcnemar's Test P-Value : NA         
                                     
            Sensitivity : 1.0000     
            Specificity : 1.0000     
         Pos Pred Value : 1.0000     
         Neg Pred Value : 1.0000     
             Prevalence : 0.4062     
         Detection Rate : 0.4062     
   Detection Prevalence : 0.4062     
      Balanced Accuracy : 1.0000     
                                     
       'Positive' Class : 1

# --- 3. ROC Curves and AUC ---
library(pROC)

Type 'citation("pROC")' for a citation.


Attaching package: 'pROC'

The following objects are masked from 'package:stats':

    cov, smooth, var

# ROC curve objects
roc_lpm   <- roc(mtcars$am, mtcars$pred_lpm)

Setting levels: control = 0, case = 1

Setting direction: controls < cases

roc_logit <- roc(mtcars$am, mtcars$pred_logit)

Setting levels: control = 0, case = 1
Setting direction: controls < cases

# Plot ROC curves
plot(roc_lpm, col = "blue", lwd = 2, main = "ROC Curve: LPM vs Logit",
     legacy.axes = TRUE)
plot(roc_logit, col = "red", lwd = 2, add = TRUE)
abline(a = 0, b = 1, lty = 2, col = "gray")

legend("bottomright",
       legend = c(
         paste("LPM  (AUC =", round(auc(roc_lpm), 3), ")"),
         paste("Logit (AUC =", round(auc(roc_logit), 3), ")")
       ),
       col = c("blue", "red"), lwd = 2)

# --- Optional: Display AUC values ---
cat("\nAUC (LPM):",  round(auc(roc_lpm), 3))


AUC (LPM): 0.988

cat("\nAUC (Logit):", round(auc(roc_logit), 3))


AUC (Logit): 1

Confusion Matrix:
Counts correct vs incorrect classifications at a cutoff (here 0.5).
ROC Curve:
Plots True Positive Rate (Sensitivity) vs False Positive Rate (1 − Specificity)
across all possible cutoffs — so it shows how well the model discriminates overall.
AUC (Area Under Curve):
A summary measure of model performance.
- 0.5 → random guessing
- 1.0 → perfect prediction
- Logit should yield higher AUC than LPM.

Extension Topics

Optional Advanced Material

Probit model: Logit’s sibling (Normal CDF instead of logistic)
Multinomial Logit: Categorical Y with >2 outcomes
Conditional Logit: McFadden’s model for discrete choice (transport, job, etc.)
Log-likelihood ratio tests, AIC/BIC

Closing Thought

“OLS taught us how to estimate a mean.
MLE and Logit taught us how to estimate a probability—
and how to think in terms of likelihood, not just distance from a line.”