# Sample visualization code
library(ggplot2)
# Generate data
set.seed(123)
x <- seq(-5, 5, length.out = 100)
lpm <- 0.5 + 0.1 * x
logit <- 1 / (1 + exp(-x))
# Plot
ggplot() +
geom_line(aes(x = x, y = lpm, color = "LPM"), linewidth = 1) +
geom_line(aes(x = x, y = logit, color = "Logit"), linewidth = 1) +
ylim(c(-0.2, 1.2)) +
labs(x = "X", y = "P(Y=1)", color = "Model") +
theme_minimal()From Linear Models to Logistic Regression
A Complete Guide for Econometrics & Applied Data Analysis
Motivation: Why Move Beyond OLS?
Goal: Transition students from the familiar (OLS) to the new (Logit).
Key Ideas
- In OLS, \(Y_i\) is continuous; in binary models, \(Y_i \in \{0,1\}\)
- We’re now modeling probabilities: \(P(Y_i=1|X_i)\)
- The OLS model (Linear Probability Model, LPM) can handle this—but with major problems
Example
\(Y_i = 1\) if Employed, \(0\) if Not; \(X_i = \text{Years of Education}\)
Discussion Questions
- What happens if predicted \(Y_i < 0\) or \(> 1\)?
- Why does OLS assume linearity when probabilities are naturally curved?
Historical Context: OLS → LPM → Logit
| Era | Method | Key Figures | Motivation |
|---|---|---|---|
| 1800s | OLS | Gauss & Legendre | Fitting continuous data (planetary orbits) |
| 1940s | Linear Probability Model (LPM) | Early econometricians | Simple, intuitive, computationally cheap |
| 1944 | Logit introduced | Joseph Berkson | Bioassay (dose-response data) |
| 1970s | MLE becomes mainstream | Fisher, McFadden, Heckman | Computers enable iterative estimation |
Discussion prompt: Why did econometrics stay with OLS for so long even after MLE was discovered?
The Linear Probability Model (LPM)
Model Specification
\[P(Y_i=1|X_i) = \beta_0 + \beta_1 X_i\]
Pros
- Simple interpretation (\(\beta_1 = \Delta P/\Delta X\))
- Estimated by OLS
Cons
- Predicted probabilities < 0 or > 1
- Constant marginal effects
- Heteroskedastic errors
- Inefficient and invalid standard errors
Visual Comparison
Introducing the Logit Model
Step 1: Define the Probability Model
\[P_i = P(Y_i = 1 | X_i) = \frac{e^{X_i\beta}}{1 + e^{X_i\beta}}\]
Step 2: Transform to Linear Form
Take log-odds:
\[\log\left( \frac{P_i}{1 - P_i} \right) = X_i\beta\]
Interpretation
- Left-hand side: log-odds (logit)
- Right-hand side: linear predictor (same structure as OLS)
Bernoulli Distribution & Link Function
Random Component
\[ Y_i \sim \text{Bernoulli}(p_i) \]
Each \(Y_i\) can take only 0 or 1, with probability of success \(p_i = P(Y_i=1)\).
We assume each observation is independent and follows its own Bernoulli process.
Therefore,
\[ P(Y_i|p_i) = p_i^{Y_i}(1 - p_i)^{1-Y_i} \]
This single expression covers both cases:
if \(Y_i=1\), probability = \(p_i\); if \(Y_i=0\), probability = \(1-p_i\).
Systematic Component
\[ p_i = f(X_i\beta) \]
The probability of success depends on a set of explanatory variables \(X_i\) through a function \(f(\cdot)\).
We choose \(f\) to ensure predicted probabilities always stay between 0 and 1.
Link Function
\[ g(p_i) = X_i\beta \]
The link function \(g(\cdot)\) connects the mean of the Bernoulli variable to a linear predictor.
It transforms probabilities into an unbounded scale that we can model linearly.
where \(g(\cdot)\) is the logit function:
\[ g(p_i) = \log\!\left(\frac{p_i}{1-p_i}\right) \]
The log of odds maps \((0,1)\) → \((-\infty, +\infty)\),
making the relationship linear in parameters: \(\log(\text{odds}) = X_i\beta\).
This makes it a Generalized Linear Model (GLM):
-
Family: Bernoulli (for binary data)
-
Link: Logit (maps probabilities to real numbers)
- Mean: \(E[Y_i] = p_i\)
So logistic regression is just a GLM with a Bernoulli response and a logit link.
Likelihood and Maximum Likelihood Estimation
Likelihood for One Observation
\[ L_i(\beta) = p_i^{Y_i}(1-p_i)^{1-Y_i} \]
The likelihood expresses how plausible the parameter \(\beta\) makes the observed outcome \(Y_i\).
If \(Y_i=1\), it equals \(p_i\); if \(Y_i=0\), it equals \(1-p_i\).
https://en.wikipedia.org/wiki/Bernoulli_distribution
Joint Likelihood (All Observations)
\[ L(\beta) = \prod_{i=1}^{n} p_i^{Y_i}(1-p_i)^{1-Y_i} \]
Assuming observations are independent, we multiply individual likelihoods.
This gives the probability of observing the entire dataset, given the model.
Log-Likelihood
\[ \ell(\beta) = \sum_{i=1}^{n} \left[Y_i \log(p_i) + (1-Y_i)\log(1-p_i)\right] \]
Taking logs converts products into sums and is numerically more stable.
Maximizing the log-likelihood gives the same solution as maximizing the original likelihood.
MLE Objective
\[ \hat{\beta} = \arg\max_\beta \ell(\beta) \]
The MLE chooses the \(\beta\) that makes the observed 0s and 1s most probable.
No closed-form solution exists → we use iterative optimization (Newton–Raphson / IRLS).
MLE in Broader Context
Comparing OLS and MLE
| Aspect | OLS | MLE |
|---|---|---|
| Objective | Minimize squared errors | Maximize likelihood of observed data |
| Assumptions | Normal errors | Distribution depends on Y |
| Works for | Continuous Y | Any distribution (Bernoulli, Poisson, etc.) |
| Estimation | Closed form | Iterative |
| Historical origin | Gauss–Legendre (1800s) | Fisher (1920s) |
| Output | \(\hat{\beta}\), residual variance | \(\hat{\beta}\), log-likelihood, standard errors |
Interpretation in Logit
Key Interpretations
- \(\beta_j\): Change in log-odds for a one-unit change in \(X_j\)
- \(e^{\beta_j}\): Odds ratio
- Marginal effects: Change in predicted probability (computed at mean X)
Example
If \(e^{\beta_1} = 1.5\):
→ Each extra year of education multiplies odds of being employed by 1.5
Model Assessment
| Tool | Purpose |
|---|---|
| Confusion matrix | Classify 0/1 outcomes |
| Accuracy, Precision, Recall, F1 | Performance metrics |
| ROC curve, AUC | Tradeoff between true/false positives |
| Pseudo-R² | Goodness of fit (e.g., McFadden R²) |
Summary Comparison
| Concept | OLS | LPM | Logit |
|---|---|---|---|
| Outcome | Continuous | Binary | Binary |
| Estimation | Least Squares | Least Squares | Maximum Likelihood |
| Range of fitted values | ℝ | (−∞, ∞) | (0, 1) |
| Efficiency | BLUE | Inefficient | MLE (efficient) |
| Interpretation | Marginal change in Y | Marginal change in P | Change in log-odds / odds ratio |
| Common use | Linear models | Simple binary outcomes | Modern discrete choice / binary models |
Example Code
# Load data
?mtcars
# 1. Linear Probability Model
lpm <- lm(am ~ mpg + cyl + disp + gear + vs,
data = mtcars
)
summary(lpm)
Call:
lm(formula = am ~ mpg + cyl + disp + gear + vs, data = mtcars)
Residuals:
Min 1Q Median 3Q Max
-0.56766 -0.13390 0.00638 0.16812 0.51349
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.0862077 0.8791623 -1.236 0.227691
mpg 0.0307034 0.0167416 1.834 0.078131 .
cyl -0.0422230 0.0840822 -0.502 0.619779
disp -0.0004265 0.0010237 -0.417 0.680371
gear 0.3810048 0.0867819 4.390 0.000168 ***
vs -0.3878550 0.1813004 -2.139 0.041970 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.2749 on 26 degrees of freedom
Multiple R-squared: 0.7454, Adjusted R-squared: 0.6965
F-statistic: 15.23 on 5 and 26 DF, p-value: 5.09e-07
# 2. Logistic Regression
logit <- glm(am ~ mpg + cyl + disp + gear + vs,
data = mtcars,
family = binomial(link = "logit")
)Warning: glm.fit: algorithm did not converge
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(logit)
Call:
glm(formula = am ~ mpg + cyl + disp + gear + vs, family = binomial(link = "logit"),
data = mtcars)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 225.636 412601.332 0.001 1.000
mpg -6.290 7069.826 -0.001 0.999
cyl 5.619 38279.009 0.000 1.000
disp -1.677 1188.168 -0.001 0.999
gear 88.120 65804.055 0.001 0.999
vs -242.077 198825.189 -0.001 0.999
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 4.3230e+01 on 31 degrees of freedom
Residual deviance: 3.8107e-09 on 26 degrees of freedom
AIC: 12
Number of Fisher Scoring iterations: 25
# 3. Compare predictions
mtcars$pred_lpm <- predict(lpm)
mtcars$pred_logit <- predict(logit, type = "response")
# 4. Marginal effects (example for education)
library(margins)
?margins
margins(model = logit,
data = find_data(logit)
)Average marginal effects
glm(formula = am ~ mpg + cyl + disp + gear + vs, family = binomial(link = "logit"), data = mtcars)
mpg cyl disp gear vs
-3.748e-10 3.346e-10 -1.009e-10 5.264e-09 -1.443e-08
— Compare Predictions & Evaluate Model Performance —
# Predicted probabilities
mtcars$pred_lpm <- predict(lpm)
mtcars$pred_logit <- predict(logit, type = "response")
# Clamp LPM predictions to [0,1] so they can be treated as probabilities
mtcars$pred_lpm <- pmin(pmax(mtcars$pred_lpm, 0), 1)
# --- 1. Confusion Matrices ---
# Threshold (you can adjust)
thresh <- 0.5
# Predicted classes
mtcars$pred_lpm_class <- ifelse(mtcars$pred_lpm >= thresh, 1, 0)
mtcars$pred_logit_class <- ifelse(mtcars$pred_logit >= thresh, 1, 0)
# Confusion matrix for LPM
cat("\n=== Confusion Matrix: Linear Probability Model ===\n")
=== Confusion Matrix: Linear Probability Model ===
table(Predicted = mtcars$pred_lpm_class, Actual = mtcars$am) Actual
Predicted 0 1
0 17 1
1 2 12
# Confusion matrix for Logit
cat("\n=== Confusion Matrix: Logistic Regression ===\n")
=== Confusion Matrix: Logistic Regression ===
table(Predicted = mtcars$pred_logit_class, Actual = mtcars$am) Actual
Predicted 0 1
0 19 0
1 0 13
Loading required package: lattice
cat("\n=== LPM Classification Report ===\n")
=== LPM Classification Report ===
confusionMatrix(as.factor(mtcars$pred_lpm_class), as.factor(mtcars$am), positive = "1")Confusion Matrix and Statistics
Reference
Prediction 0 1
0 17 1
1 2 12
Accuracy : 0.9062
95% CI : (0.7498, 0.9802)
No Information Rate : 0.5938
P-Value [Acc > NIR] : 0.000105
Kappa : 0.808
Mcnemar's Test P-Value : 1.000000
Sensitivity : 0.9231
Specificity : 0.8947
Pos Pred Value : 0.8571
Neg Pred Value : 0.9444
Prevalence : 0.4062
Detection Rate : 0.3750
Detection Prevalence : 0.4375
Balanced Accuracy : 0.9089
'Positive' Class : 1
cat("\n=== Logit Classification Report ===\n")
=== Logit Classification Report ===
confusionMatrix(as.factor(mtcars$pred_logit_class), as.factor(mtcars$am), positive = "1")Confusion Matrix and Statistics
Reference
Prediction 0 1
0 19 0
1 0 13
Accuracy : 1
95% CI : (0.8911, 1)
No Information Rate : 0.5938
P-Value [Acc > NIR] : 5.693e-08
Kappa : 1
Mcnemar's Test P-Value : NA
Sensitivity : 1.0000
Specificity : 1.0000
Pos Pred Value : 1.0000
Neg Pred Value : 1.0000
Prevalence : 0.4062
Detection Rate : 0.4062
Detection Prevalence : 0.4062
Balanced Accuracy : 1.0000
'Positive' Class : 1
Type 'citation("pROC")' for a citation.
Attaching package: 'pROC'
The following objects are masked from 'package:stats':
cov, smooth, var
# ROC curve objects
roc_lpm <- roc(mtcars$am, mtcars$pred_lpm)Setting levels: control = 0, case = 1
Setting direction: controls < cases
roc_logit <- roc(mtcars$am, mtcars$pred_logit)Setting levels: control = 0, case = 1
Setting direction: controls < cases
# Plot ROC curves
plot(roc_lpm, col = "blue", lwd = 2, main = "ROC Curve: LPM vs Logit",
legacy.axes = TRUE)
plot(roc_logit, col = "red", lwd = 2, add = TRUE)
abline(a = 0, b = 1, lty = 2, col = "gray")
legend("bottomright",
legend = c(
paste("LPM (AUC =", round(auc(roc_lpm), 3), ")"),
paste("Logit (AUC =", round(auc(roc_logit), 3), ")")
),
col = c("blue", "red"), lwd = 2)
AUC (LPM): 0.988
AUC (Logit): 1
Confusion Matrix:
Counts correct vs incorrect classifications at a cutoff (here 0.5).ROC Curve:
Plots True Positive Rate (Sensitivity) vs False Positive Rate (1 − Specificity)
across all possible cutoffs — so it shows how well the model discriminates overall.-
AUC (Area Under Curve):
A summary measure of model performance.- 0.5 → random guessing
- 1.0 → perfect prediction
- Logit should yield higher AUC than LPM.
- 0.5 → random guessing
Extension Topics
Optional Advanced Material
- Probit model: Logit’s sibling (Normal CDF instead of logistic)
- Multinomial Logit: Categorical Y with >2 outcomes
- Conditional Logit: McFadden’s model for discrete choice (transport, job, etc.)
- Log-likelihood ratio tests, AIC/BIC
Closing Thought
“OLS taught us how to estimate a mean.
MLE and Logit taught us how to estimate a probability—
and how to think in terms of likelihood, not just distance from a line.”