OLS regression relies on the exogeneity
assumption:
\[
\text{Cov}(X, \varepsilon) = 0
\]
This means:
- Your explanatory variables (X) must not
correlate with unobserved factors in the error term
(ε).
- If they do, OLS estimates become biased ( wrong on
average) and inconsistent (do not improve even with
large datasets).
- Invalid inference: Confidence intervals and
hypothesis tests become unreliable.
The OLS estimator is:
\[
\hat{\beta} = \beta +
\underbrace{(X'X)^{-1}X'\varepsilon}_{\text{Bias term}}
\]
If \(X\) and \(\varepsilon\) are correlated, the bias term
doesn’t vanish, even as \(n
\to \infty\).
Example: Suppose we estimate:
\[
\text{GPA} = \beta_0 + \beta_1 \text{StudyHours} + \varepsilon
\]
- If ε includes “natural ability,” and smarter students
study more, then:
\[
\text{Cov}(\text{StudyHours}, \varepsilon) > 0
\]
OLS attributes both study effort AND ability to β₁
Result: OLS overestimates the effect of studying because it conflates study hours with innate ability.
Analogy: Using a thermometer affected by sunlight → biased temperature readings.
Example: Model students grade using attendance at lectures.
Research Question: Does taking a preparatory math course (participation) improve GPA in an engineering MOOC?
The Data (from TrainExer45.gdt
):
Variable | Description | Role |
---|---|---|
GPA |
Grade Point Average (0-10) | Outcome |
Participation |
1 if took prep course (voluntary) | Endogenous X |
Gender |
Control variable | Exogenous |
Email |
1 if received invitation | Instrument (Z) |
In our GPA study: - Prep course participation is voluntary (self-selection) - Motivated students (high ε) are more likely to participate - OLS conflates course effect with student motivation
Analogy: Measuring a drug’s effect when only healthy patients take it.
A valid instrument Z must satisfy: 1. Relevance: Correlated with endogenous X (participation) - Test: Strong first-stage relationship (F-stat > 10) 2. Exogeneity: Uncorrelated with ε (affects y only through X) - No statistical test - must argue conceptually
- **Nature**: Random technical issue caused some students to **not receive** the invitation
- **Why valid?**
Endogeneity Risk:
- Motivated students (high ε) are more likely to take the
course.
- OLS conflates the course effect with
motivation bias.
Model → Ordinary Least Squares
GPA
const Gender Participation
Interpretation: Likely overestimated due to self-selection.
Model → Two-Stage Least Squares
Participation
const Gender Email
V
Key Output: - Check if Email
is
significant (t-stat > 2) - F-statistic should be > 10 (weak
instrument test)
e_OLS
)Model → Ordinary Least Squares
e_OLS
const Gender Participation V
V
is significant (p <
0.05), OLS is biasedIntuition: V
captures the
“self-selection” part of participation that correlates with GPA
errors.
Model → Two-Stage Least Squares
GPA
const Gender Participation
const Gender Email
Interpretation: The true causal effect is much smaller than OLS suggested!
Task | Menu Path | Key Check |
---|---|---|
OLS | Model → OLS |
Compare with 2SLS |
First-stage regression | Model → Two-Stage Least Squares |
F-stat > 10 |
Hausman test | Regress OLS residuals on X and V | p-value of V |
2SLS | Model → Two-Stage Least Squares |
Smaller coefficient = less bias |
Final Advice: “Finding good instruments is like detective work - look for natural experiments and argue carefully for exogeneity!”