Derivation of OLS Estimator

POL 682: Linear Regression Analysis

Chris Weber

chrisweber@arizona.edu

University of Arizona

School of Government and Public Policy

2026-01-26

The Regression Functions

Population Regression Function (PRF)

\[Y_i = \alpha+\beta X_i + \epsilon_i\]

Sample Regression Function (SRF)

\[Y_i = a+b X_i + e_i\]

Understanding the Error Term

Rearranging the SRF reveals the error:

\[Y_i-\overbrace{(a+b X_i)}^{\hat{Y}_{i}}=e_i\]

\(\hat{Y}_i\) = predicted value (regression line/plane)
\(e_i\) = residual (observed - predicted)
Goal: Find a line that minimizes \(e\)

Does Ice Cream Kill?

Consider a hypothetical study (with fictitious data)

Researcher collects monthly data on:
- Ice cream consumption (lbs/capita)
- Child fatalities (deaths/month)
Discovers a strong positive correlation
Regression line significantly different from zero
Concludes: Ice cream consumption is dangerous!

Question: Is this a causal relationship?

The Spurious Correlation

The Data

Confounding Variables

The Purported versus Real Story: Ice cream doesn’t cause child deaths

This is a spurious correlation - both variables are influenced by a confounder

Summer → ↑ Ice cream consumption Summer → ↑ Swimming/outdoor activities → ↑ Child fatalities

Correlation ≠ Causation

Before interpreting regression coefficients as causal, consider:
- What is the causal structure?
- Are there potential confounders?
- Are there omitted variables?
Regression shows association, not necessarily causation

Back to Minimizing Error

Why not just minimize \(\sum_{i=1}^n e_i\)?

Problem: Any line through \((\bar{X}, \bar{Y})\) gives:

\[\sum e_i=\sum [(Y_i - \bar{Y})] = 0\]

Alternatives:

Use \(|e|\) → Absolute Value Regression (later in semester)
Use \(e^2\) → Ordinary Least Squares (today!)

The OLS Principle

Minimize the Sum of Squared Residuals (SSR):

\[\text{min} \sum_{i=1}^n e_i^2\]

Equivalently:

\[SSR = \sum_{i=1}^n e_i^2=\sum_{i=1}^n(Y_i-a-bX_i)^2\]

We solve for \(a\) and \(b\) that minimize SSR

Visual Representation

Deriving the OLS Estimator

We have two unknowns: \(a\) and \(b\)

Take partial derivatives of SSR and set to zero:

\[\frac{\partial SSR}{\partial a}=-2\sum(Y_i-a-bX_i)=0\]

\[\frac{\partial SSR}{\partial b}=-2\sum(Y_i-a-bX_i)X_i=0\]

The Intercept Formula

Setting \(\frac{\partial SSR}{\partial a} = 0\):

\[\begin{align} 0 &= -2 \sum (Y_i-a-bX_i) \\ 0 &= \sum Y_i-na-b \sum X_i \\ na &= \sum Y_i - b \sum X_i \\ a &= \frac{\sum Y_i}{n} - b\frac{\sum X_i}{n} \\ a &= \bar{Y} - b \bar{X} \end{align}\]

Key insight: The regression line passes through \((\bar{X}, \bar{Y})\)

The Slope Formula

Setting \(\frac{\partial SSR}{\partial b} = 0\):

\[\begin{align} 0 &= \sum (Y_i-a-bX_i)(-X_i) \\ 0 &= \sum X_i Y_i - a \sum X_i - b\sum X_i^2 \\ b &= \frac{\sum X_iY_i-n \bar{X}\bar{Y}}{\sum X_i^2-n\bar{X}^2} \end{align}\]

Alternative form (using deviations):

\[b = \frac{\sum x_i y_i}{\sum x_i^2}\]

where \(x_i=X_i-\bar{X}\) and \(y_i=Y_i-\bar{Y}\)

The Normal Equations

The two equations we solved:

\[\sum Y_i = na + b\sum X_i\]

\[\sum X_i Y_i = a \sum X_i + b\sum X_i^2\]

These are the normal equations for OLS

Important Properties

From the OLS derivation, we get:

\(\sum e_i = 0\) The sum of residuals is zero (line passes through means)
\(\sum X_i e_i = 0\) The covariance between X and the error is zero

Critical assumption: \(Cov(X, \epsilon) = 0\)

This is built into the OLS estimator but may not hold in reality!

When the Assumption Fails

The assumption \(\sum X_i e_i = 0\) can be violated when:

Omitted variable bias: Excluded variables correlated with X
Measurement error: Error in measuring X
Simultaneity: X and Y jointly determined
Confounding: Unobserved variables affect both X and Y

Consequence: Estimates are biased and inconsistent

Conditional Expectation

The regression line represents a conditional mean:

\[\hat{Y} = E(Y_i | X_i) = a + b X_i\]

If X and Y are independent: \(E(Y_i | X_i) = E(Y_i) = \bar{Y}\)

The slope \(b = 0\)
Knowing X doesn’t help predict Y

Partitioning Variance

We can decompose the total variation in Y:

\[\underbrace{\sum (Y_i-\bar{Y})^2}_{\text{TSS}} = \underbrace{\sum (Y_i-\hat{Y})^2}_{\text{RSS}} + \underbrace{\sum (\hat{Y}-\bar{Y})^2}_{\text{RegSS}}\]

Where:

TSS = Total Sum of Squares
RSS = Residual Sum of Squares
RegSS = Regression Sum of Squares

R-squared: Goodness of Fit

Coefficient of Determination:

\[R^2 = \frac{\text{RegSS}}{\text{TSS}} = 1 - \frac{\text{RSS}}{\text{TSS}}\]

\(R^2 \in [0, 1]\)
Proportion of variance in Y explained by X
\(R^2 = 0\): X explains nothing
\(R^2 = 1\): X perfectly predicts Y

Relationship to Correlation

In simple linear regression:

\[R^2 = r^2\]

where \(r\) is the Pearson correlation coefficient

\(r = \sqrt{R^2}\) for positive relationships
\(r = -\sqrt{R^2}\) for negative relationships
\(r\) is a standardized measure of covariance

Key Assumptions

For OLS to be unbiased and efficient:

Linearity: \(E(Y_i | X_i) = \alpha + \beta X_i\)
- Linear in parameters, not necessarily variables
- Can estimate \(Y = a + bX^2\), but not \(Y = a + b^2X\)
Exogeneity: \(Cov(X, \epsilon) = 0\)
- No correlation between X and error
Homoscedasticity: \(Var(\epsilon_i) = \sigma^2\)
- Constant variance
Independence: Observations are independent

Linearity in Parameters

OK ✓

\(Y = a + bX\)
\(Y = a + bX^2\)
\(Y = a + b\log(X)\)
\(Y = a + b_1X + b_2X^2\)

NOT OK ✗

\(Y = a + b^2X\)
\(Y = a^b X\)
\(Y = e^{bX}\)

(Need different estimation methods)

Population vs. Sample

Remember the distinction:

Population (PRF)

\[Y_i = \alpha + \beta X_i + \epsilon_i\]

True parameters
Unknown
What we want

Sample (SRF)

\[Y_i = a + b X_i + e_i\]

Estimates
Calculated from data
What we observe

Sampling Distribution

If we repeatedly sample and estimate:

We get many SRFs: \(\text{SRF}_1, \text{SRF}_2, ..., \text{SRF}_n\)
Each has different \(a\) and \(b\)
The distribution of these estimates:
- Is centered on true values (unbiased)
- Has variance that decreases with sample size
We use this distribution for inference

Sampling Error ≠ Residual

Important distinction:

Residual (\(e_i\))

Difference between observed and fitted
\(e_i = Y_i - \hat{Y}_i\)
Within one sample

Sampling Error

Difference between estimate and truth
\(b - \beta\) or \(a - \alpha\)
Across samples

Summary: The OLS Recipe

Goal: Find the line that minimizes \(\sum e_i^2\)
Method: Take derivatives, set to zero
Result:
- \(b = \frac{\sum x_iy_i}{\sum x_i^2}\)
- \(a = \bar{Y} - b\bar{X}\)
Properties:
- \(\sum e_i = 0\)
- \(\sum X_ie_i = 0\)
- Line passes through \((\bar{X}, \bar{Y})\)
Assumptions: Linearity, exogeneity, homoscedasticity, independence

Key Takeaways

OLS minimizes squared residuals
Provides unbiased estimates under standard assumptions
Correlation ≠ Causation - always consider confounding
R² measures proportion of variance explained
The method is mechanical - interpretation requires thought!

Next Steps

Multiple regression (more than one X)
Hypothesis testing and confidence intervals
Diagnostics and assumption checking
Dealing with violations (heteroscedasticity, etc.)

Questions?

Next: Multiple Linear Regression