Derivation of OLS Estimator

POL 682: Linear Regression Analysis

Chris Weber

University of Arizona

School of Government and Public Policy

2026-01-26

The Regression Functions

Population Regression Function (PRF)

\[Y_i = \alpha+\beta X_i + \epsilon_i\]

Sample Regression Function (SRF)

\[Y_i = a+b X_i + e_i\]

Understanding the Error Term

Rearranging the SRF reveals the error:

\[Y_i-\overbrace{(a+b X_i)}^{\hat{Y}_{i}}=e_i\]

  • \(\hat{Y}_i\) = predicted value (regression line/plane)
  • \(e_i\) = residual (observed - predicted)
  • Goal: Find a line that minimizes \(e\)

Does Ice Cream Kill?

Consider a hypothetical study (with fictitious data)

  • Researcher collects monthly data on:
    • Ice cream consumption (lbs/capita)
    • Child fatalities (deaths/month)
  • Discovers a strong positive correlation
  • Regression line significantly different from zero
  • Concludes: Ice cream consumption is dangerous!

Question: Is this a causal relationship?

The Spurious Correlation

The Data

Confounding Variables

The Purported versus Real Story: Ice cream doesn’t cause child deaths

This is a spurious correlation - both variables are influenced by a confounder

Summer → ↑ Ice cream consumption Summer → ↑ Swimming/outdoor activities → ↑ Child fatalities

Correlation ≠ Causation

  • Before interpreting regression coefficients as causal, consider:
    • What is the causal structure?
    • Are there potential confounders?
    • Are there omitted variables?
  • Regression shows association, not necessarily causation

Back to Minimizing Error

Why not just minimize \(\sum_{i=1}^n e_i\)?

Problem: Any line through \((\bar{X}, \bar{Y})\) gives:

\[\sum e_i=\sum [(Y_i - \bar{Y})] = 0\]

Alternatives:

  • Use \(|e|\)Absolute Value Regression (later in semester)
  • Use \(e^2\)Ordinary Least Squares (today!)

The OLS Principle

Minimize the Sum of Squared Residuals (SSR):

\[\text{min} \sum_{i=1}^n e_i^2\]

Equivalently:

\[SSR = \sum_{i=1}^n e_i^2=\sum_{i=1}^n(Y_i-a-bX_i)^2\]

We solve for \(a\) and \(b\) that minimize SSR

Visual Representation

Deriving the OLS Estimator

We have two unknowns: \(a\) and \(b\)

Take partial derivatives of SSR and set to zero:

\[\frac{\partial SSR}{\partial a}=-2\sum(Y_i-a-bX_i)=0\]

\[\frac{\partial SSR}{\partial b}=-2\sum(Y_i-a-bX_i)X_i=0\]

The Intercept Formula

Setting \(\frac{\partial SSR}{\partial a} = 0\):

\[\begin{align} 0 &= -2 \sum (Y_i-a-bX_i) \\ 0 &= \sum Y_i-na-b \sum X_i \\ na &= \sum Y_i - b \sum X_i \\ a &= \frac{\sum Y_i}{n} - b\frac{\sum X_i}{n} \\ a &= \bar{Y} - b \bar{X} \end{align}\]

Key insight: The regression line passes through \((\bar{X}, \bar{Y})\)

The Slope Formula

Setting \(\frac{\partial SSR}{\partial b} = 0\):

\[\begin{align} 0 &= \sum (Y_i-a-bX_i)(-X_i) \\ 0 &= \sum X_i Y_i - a \sum X_i - b\sum X_i^2 \\ b &= \frac{\sum X_iY_i-n \bar{X}\bar{Y}}{\sum X_i^2-n\bar{X}^2} \end{align}\]

Alternative form (using deviations):

\[b = \frac{\sum x_i y_i}{\sum x_i^2}\]

where \(x_i=X_i-\bar{X}\) and \(y_i=Y_i-\bar{Y}\)

The Normal Equations

The two equations we solved:

\[\sum Y_i = na + b\sum X_i\]

\[\sum X_i Y_i = a \sum X_i + b\sum X_i^2\]

These are the normal equations for OLS

Important Properties

From the OLS derivation, we get:

  1. \(\sum e_i = 0\) The sum of residuals is zero (line passes through means)

  2. \(\sum X_i e_i = 0\) The covariance between X and the error is zero

Critical assumption: \(Cov(X, \epsilon) = 0\)

This is built into the OLS estimator but may not hold in reality!

When the Assumption Fails

The assumption \(\sum X_i e_i = 0\) can be violated when:

  • Omitted variable bias: Excluded variables correlated with X
  • Measurement error: Error in measuring X
  • Simultaneity: X and Y jointly determined
  • Confounding: Unobserved variables affect both X and Y

Consequence: Estimates are biased and inconsistent

Conditional Expectation

The regression line represents a conditional mean:

\[\hat{Y} = E(Y_i | X_i) = a + b X_i\]

If X and Y are independent: \(E(Y_i | X_i) = E(Y_i) = \bar{Y}\)

  • The slope \(b = 0\)
  • Knowing X doesn’t help predict Y

Partitioning Variance

We can decompose the total variation in Y:

\[\underbrace{\sum (Y_i-\bar{Y})^2}_{\text{TSS}} = \underbrace{\sum (Y_i-\hat{Y})^2}_{\text{RSS}} + \underbrace{\sum (\hat{Y}-\bar{Y})^2}_{\text{RegSS}}\]

Where:

  • TSS = Total Sum of Squares
  • RSS = Residual Sum of Squares
  • RegSS = Regression Sum of Squares

R-squared: Goodness of Fit

Coefficient of Determination:

\[R^2 = \frac{\text{RegSS}}{\text{TSS}} = 1 - \frac{\text{RSS}}{\text{TSS}}\]

  • \(R^2 \in [0, 1]\)
  • Proportion of variance in Y explained by X
  • \(R^2 = 0\): X explains nothing
  • \(R^2 = 1\): X perfectly predicts Y

Relationship to Correlation

In simple linear regression:

\[R^2 = r^2\]

where \(r\) is the Pearson correlation coefficient

  • \(r = \sqrt{R^2}\) for positive relationships
  • \(r = -\sqrt{R^2}\) for negative relationships
  • \(r\) is a standardized measure of covariance

Key Assumptions

For OLS to be unbiased and efficient:

  1. Linearity: \(E(Y_i | X_i) = \alpha + \beta X_i\)
    • Linear in parameters, not necessarily variables
    • Can estimate \(Y = a + bX^2\), but not \(Y = a + b^2X\)
  2. Exogeneity: \(Cov(X, \epsilon) = 0\)
    • No correlation between X and error
  3. Homoscedasticity: \(Var(\epsilon_i) = \sigma^2\)
    • Constant variance
  4. Independence: Observations are independent

Linearity in Parameters

OK

  • \(Y = a + bX\)
  • \(Y = a + bX^2\)
  • \(Y = a + b\log(X)\)
  • \(Y = a + b_1X + b_2X^2\)

NOT OK

  • \(Y = a + b^2X\)
  • \(Y = a^b X\)
  • \(Y = e^{bX}\)

(Need different estimation methods)

Population vs. Sample

Remember the distinction:

Population (PRF)

\[Y_i = \alpha + \beta X_i + \epsilon_i\]

  • True parameters
  • Unknown
  • What we want

Sample (SRF)

\[Y_i = a + b X_i + e_i\]

  • Estimates
  • Calculated from data
  • What we observe

Sampling Distribution

If we repeatedly sample and estimate:

  • We get many SRFs: \(\text{SRF}_1, \text{SRF}_2, ..., \text{SRF}_n\)
  • Each has different \(a\) and \(b\)
  • The distribution of these estimates:
    • Is centered on true values (unbiased)
    • Has variance that decreases with sample size
  • We use this distribution for inference

Sampling Error ≠ Residual

Important distinction:

Residual (\(e_i\))

  • Difference between observed and fitted
  • \(e_i = Y_i - \hat{Y}_i\)
  • Within one sample

Sampling Error

  • Difference between estimate and truth
  • \(b - \beta\) or \(a - \alpha\)
  • Across samples

Summary: The OLS Recipe

  1. Goal: Find the line that minimizes \(\sum e_i^2\)

  2. Method: Take derivatives, set to zero

  3. Result:

    • \(b = \frac{\sum x_iy_i}{\sum x_i^2}\)
    • \(a = \bar{Y} - b\bar{X}\)
  4. Properties:

    • \(\sum e_i = 0\)
    • \(\sum X_ie_i = 0\)
    • Line passes through \((\bar{X}, \bar{Y})\)
  5. Assumptions: Linearity, exogeneity, homoscedasticity, independence

Key Takeaways

  • OLS minimizes squared residuals
  • Provides unbiased estimates under standard assumptions
  • Correlation ≠ Causation - always consider confounding
  • measures proportion of variance explained
  • The method is mechanical - interpretation requires thought!

Next Steps

  • Multiple regression (more than one X)
  • Hypothesis testing and confidence intervals
  • Diagnostics and assumption checking
  • Dealing with violations (heteroscedasticity, etc.)

Questions?

Next: Multiple Linear Regression