POL 682: Linear Regression Analysis
2026-01-26
Population Regression Function (PRF)
\[Y_i = \alpha+\beta X_i + \epsilon_i\]
Sample Regression Function (SRF)
\[Y_i = a+b X_i + e_i\]
Rearranging the SRF reveals the error:
\[Y_i-\overbrace{(a+b X_i)}^{\hat{Y}_{i}}=e_i\]
Consider a hypothetical study (with fictitious data)
Question: Is this a causal relationship?
The Purported versus Real Story: Ice cream doesn’t cause child deaths
This is a spurious correlation - both variables are influenced by a confounder
Summer → ↑ Ice cream consumption Summer → ↑ Swimming/outdoor activities → ↑ Child fatalities
Why not just minimize \(\sum_{i=1}^n e_i\)?
Problem: Any line through \((\bar{X}, \bar{Y})\) gives:
\[\sum e_i=\sum [(Y_i - \bar{Y})] = 0\]
Alternatives:
Minimize the Sum of Squared Residuals (SSR):
\[\text{min} \sum_{i=1}^n e_i^2\]
Equivalently:
\[SSR = \sum_{i=1}^n e_i^2=\sum_{i=1}^n(Y_i-a-bX_i)^2\]
We solve for \(a\) and \(b\) that minimize SSR
We have two unknowns: \(a\) and \(b\)
Take partial derivatives of SSR and set to zero:
\[\frac{\partial SSR}{\partial a}=-2\sum(Y_i-a-bX_i)=0\]
\[\frac{\partial SSR}{\partial b}=-2\sum(Y_i-a-bX_i)X_i=0\]
Setting \(\frac{\partial SSR}{\partial a} = 0\):
\[\begin{align} 0 &= -2 \sum (Y_i-a-bX_i) \\ 0 &= \sum Y_i-na-b \sum X_i \\ na &= \sum Y_i - b \sum X_i \\ a &= \frac{\sum Y_i}{n} - b\frac{\sum X_i}{n} \\ a &= \bar{Y} - b \bar{X} \end{align}\]
Key insight: The regression line passes through \((\bar{X}, \bar{Y})\)
Setting \(\frac{\partial SSR}{\partial b} = 0\):
\[\begin{align} 0 &= \sum (Y_i-a-bX_i)(-X_i) \\ 0 &= \sum X_i Y_i - a \sum X_i - b\sum X_i^2 \\ b &= \frac{\sum X_iY_i-n \bar{X}\bar{Y}}{\sum X_i^2-n\bar{X}^2} \end{align}\]
Alternative form (using deviations):
\[b = \frac{\sum x_i y_i}{\sum x_i^2}\]
where \(x_i=X_i-\bar{X}\) and \(y_i=Y_i-\bar{Y}\)
The two equations we solved:
\[\sum Y_i = na + b\sum X_i\]
\[\sum X_i Y_i = a \sum X_i + b\sum X_i^2\]
These are the normal equations for OLS
From the OLS derivation, we get:
\(\sum e_i = 0\) The sum of residuals is zero (line passes through means)
\(\sum X_i e_i = 0\) The covariance between X and the error is zero
Critical assumption: \(Cov(X, \epsilon) = 0\)
This is built into the OLS estimator but may not hold in reality!
The assumption \(\sum X_i e_i = 0\) can be violated when:
Consequence: Estimates are biased and inconsistent
The regression line represents a conditional mean:
\[\hat{Y} = E(Y_i | X_i) = a + b X_i\]
If X and Y are independent: \(E(Y_i | X_i) = E(Y_i) = \bar{Y}\)
We can decompose the total variation in Y:
\[\underbrace{\sum (Y_i-\bar{Y})^2}_{\text{TSS}} = \underbrace{\sum (Y_i-\hat{Y})^2}_{\text{RSS}} + \underbrace{\sum (\hat{Y}-\bar{Y})^2}_{\text{RegSS}}\]
Where:
Coefficient of Determination:
\[R^2 = \frac{\text{RegSS}}{\text{TSS}} = 1 - \frac{\text{RSS}}{\text{TSS}}\]
In simple linear regression:
\[R^2 = r^2\]
where \(r\) is the Pearson correlation coefficient
For OLS to be unbiased and efficient:
OK ✓
NOT OK ✗
(Need different estimation methods)
Remember the distinction:
Population (PRF)
\[Y_i = \alpha + \beta X_i + \epsilon_i\]
Sample (SRF)
\[Y_i = a + b X_i + e_i\]
If we repeatedly sample and estimate:
Important distinction:
Residual (\(e_i\))
Sampling Error
Goal: Find the line that minimizes \(\sum e_i^2\)
Method: Take derivatives, set to zero
Result:
Properties:
Assumptions: Linearity, exogeneity, homoscedasticity, independence
Next: Multiple Linear Regression
POL 682 | Introduction to Linear Regression