Linear Regression Review

change change
test test will this work

Regression vs Classification

Regression: $f(X)=E[Y|X]$
- conditional expectation of Y given X
Classification: $f(X)=Pr[Y=\text {label}|X]$
- conditional probability that y takes on a given label, given X
why conditional expectations?
- $E[Y|X]$ minimizes the mean squared error
- $E[\epsilon |X]=0$ is uncorrelated with any function of X
- we have broken Y into a component explained by X, and another component that is orthogonal to X
linear regression goal: find the best linear approximation of $E[Y|X]$ to minimize the mean squared error between prediction of Y and sum of actual values of Y observed at each point, estimated by $E[Y|X]=\alpha + \beta X$

Ordinary Least Squares

estimate linear regression using OLS, which finds the values of parameters to minimize prediction errors
- choose $\alpha, \beta$ to minimize the Residual Sum of Squares (RSS) \[ RSS = \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 \] where \[ \hat{\beta}=\frac {cov(x,y)}{var(X)} \]
key assumption behind OLS: $E(\epsilon|X)=0$
- re: the difference between X and Y is effectively random, and everything else in the world that explains Y (aside from X) is uncorrelated with X (aka no omitted variable bias!)
key assumption behind hypothesis testing in OLS: an individual’s error variance cannot tell me anything about another individual’s error variance
- no correlation of epsilon across individuals in our sample -> overestimation of the degree to which including X in your model explains the variation of Y

Linear Regression in R

use “binscattering” in R to produce more readable figures when there is a lot of data
the underlying relationship stays the same, with the linear OLS estimation remaining constant across the original and binned data
easier to visualize whether the data should be modeled linearly, quadratically, etc.

Group Means

sample means
- regressing $Y$ on only an intercept gives the sample mean
- lm_1 <- lm(int_rt ~ 1, data = loan_data)
group means (no intercept)
- regressing outcome on factor variable (no intercept) gives group means directly
- lm_bin_0 <- lm(int_rt ~ 0 + fico_bin, data = loan_data)
group means (with intercept)
- alternative approach: include intercept, omit one group
- coefficients then represent the difference from the omitted category
- intercept represents the mean of the omitted group
- lm_bin_1 <- lm(int_rt ~ 1 + fico_bin, data = loan_data)

Multiple Linear Regression

multiple input variables, $X_1 \text { and } X_2$
Risk of overfitting, as adding too many irrelevant predictors can model noise instead of true patterns
introduces the bias-variance tradeoff
- increasing the number of predictors $\rightarrow$ reduction in bias, higher variance
- fewer predictors $\rightarrow$ more bias, less variance
- tendency to prefer reducing bias, and accepting higher variance
coefficient plots
- useful for visualizing coefficients either within the same model, or the same coefficient across different models
coefficient tests
- when considering coefficients across models, we sometimes want to test if those coefficients are the same
- use the Wald linear test of coefficients
- testings cross models on the same sample requires “simultaneous” estimation of the models
- be sure to use heteroskedasticicity-robust standard errors

Causality and Randomized Experiments

Correlation vs Causation

RCTs: Conceptual Framework

RCTs: Regression Implementation

Causal Inference

Oregon Health Plan Experiment

Attrition Bias

differential attrition: assignment to treatment impacts attrition
selective attrition: attrition based on some characteristic, realized ex post
using randomized number of times needed to contact people to ensure that making sure that those who are harder to reach (ie causing attrition) are not systematically the same in some way. called randomized outreach approach.

Scale-Up Bias

scaling up from small experiments to big policy is harder than expected
- small scale results can differ because of differences in participants (and the way that information travels), and implementers may differ (the quality of implementers may vary based on the scale, which changes as the policy scales up)
- related to general equilibrium effects
  - ex: giving everyone a carpool sticker to everyone -> reduced value of treatment
  - part of the treatment in a small experiment is the fact that it is exclusive, so when it is scaled up, it doesn’t have the same effect

Statistical Power

larger sample $\rightarrow$ more precise estimate $\rightarrow$ closer to objective truth
chance that experiment will fail to find significant effects, even if treatment effect really does exist
- chance is large with small sample sizes, which diminishes as the sample size increases

Non-Experimental Data

Two views of the world when looking at natural experiments:

Selection on Observables (SOO): treated and untreated units vary in ways that we can observe
- fixing selection on observables often makes selection on unobservables worse
- “last resort design”
- randomization requires $Y_i (1), Y_i (0) \rightarrow D_i$
- SOO Assumes: $Y_i (1), Y_i (0) \rightarrow D_i | X_i$
Selection on Unobservables: treated and untreated units differ in ways that we cannot observe

The Overlap Assumption

once we have conditioned on X, D is as good as random

Propensity Score Method

modern approach to propensity score matching” is OLS with inverse probability weighting
- instead of picking a neighbor or looking for an exact match, we can weight each observation according to its inverse probability (more weight on what happens when someone who likely shouldn’t be in the sample is in the sample)
- treated units assigned weight $ $
- untreated units assigned weight $ $

Regression Discontinuity

approach to causal ID which relies on rules
idea: individuals on either side of the cutoff are essentially identical, but the treatment they receive differs because of some arbitrary assignment/cutoff rule
- causal effect is the difference in outcomes between those on either side of the cutoff
four “ingredients”
- causal question of interest: key element for causal data analysis
- outcome variable (Y)
- treatment variable (T)
- assignment variable (X, also called “running” var)
  - assignment variable is used to determine treatment
  - if X exceeds a certain threshold, treatment is given
Sharp regression discontinuity
- to get $\tau$, compare units with $D_i=0 \text{ and }D_i=1$ exactly at the cutoff
  
  \[ \tau^{RD}=E[Y_i(1)-Y_i(0)|X_i=c] \]
- but since we cannot observe the counterfactual, we can estimate this by taking the limit as x approaches c
  
  \[ \hat\tau^{RD} = \text{lim}E[Y_i(1)|X_i=x]-\text{lim}E[Y_i(0)|X_i=x] \]

Identification Assumptions

RD comes close to mimicking random assignment, but without true randomization, we need to satisfy certain identification assumptions
- everything moves smoothly around cutoff c, barring the discontinuous jump as a result of the cutoff
- the change in $D_i$ is the only reason for discrete jumps in $Y_i$ around cutoff c
Testing the ID assumptions
- Manipulation Test
  - we are assuming that $X_i-c$ (ie how far you are from the cutoff) is as good as randomly assigned in the neighborhood of c
  - if this is true, then units cannot sort around c (ie students cannot manipulate scores, politicians cannot change voter support)
  - we test this by looking at the distribution of $X_i$
  - if we seen “strange” behavior around the cutoff, we worry that there is manipulation, where people just below the cutoff can do something to raise themselves just over the cutoff to receive the treatment
- Covariates Smoothness Test
  - we assume that no other variables change discontinuously around cutoff
    - look at pre-treatment variables

Regression

goal: estimate the difference in outcomes for just-treated vs just-untreated individuals
- $\hat\tau=E[Y_i(1)-Y_i(0)|X_i=c=\lim_{x \to c}E[Y_i(1)|X_i=x]-\lim_{x \to c}E[Y_i(0)|X_i=x]$
implement this by comparing the average outcomes just below and just above the cutoff
- $\hat\tau=\overline{Y}(D_i = 1;\; c \leq X_i \leq c + h) - \overline{Y}(D_i = 0;\; c - h \leq X_i \leq c)$ where $c-h≤X_i≤c+h$ is the bandwidth i which we are “close” to c
Regression-Based RD: $Y_i=\alpha+\tau D_i + \epsilon_i \text{ for }c-h ≤c+h \text{ where }D_i=1(X_i≥c)$
Choosing h
- we want h to be small; the RD estimate is causal only at c
  - if too small $\rightarrow$ imprecise estimates (no sample density)
- we want h to be big enough, since our standard errors will be too huge if only using data at c
  - if too big $\rightarrow$ bias (comparing dissimilar estimates)
- example of bias-variance tradeoff
Functional Form
- a simple difference in means ignores any relationship between $Y_i$ and $X_i$
- if we know that $Y_i(X_i)$ is a linear function, we can improve on this by controlling for the underlying relationship: $Y_i=\alpha +\tau D_i+\beta(X_i-c)+\epsilon_i$
- we can also allow for different slopes on either side of the cutoff: $Y_i=\alpha + \tau D_i + \beta_1(X_i-c)+\beta_2(X_i-c)D_i+\epsilon_i$
- can also use nonlinear functional forms
  - $Y_i=\alpha +\tau D_i+f(X_i)+\epsilon_i$
  - $Y_i=\alpha + \tau D_i + f(X_i)+f(X_i)D_i+\epsilon_i$
Interpretation of RD Results
- must consider external validity; RD estimate is an example of a LATE

Fuzzy Regression Discontinuity

Sometimes, treatment status doesn’t change by 100% at the cutoff
- some units above the cutoff may get treated, and vice versa
- crossing c changes the probability of treatment

Difference in Differences

Panel Data

Time Series Data: observation on a single unit over time
Repeated Cross-Section Data: repeated sampling of different units over time (eg. Census surveys)
Panel Data: multiple observations of the same unit over time
Overarching goal: think about how information on the time dimension helps us address the selection problem in causal inference
Causal inference with cross-sectional data is hard
- people, firms, etc. are different from one another
- can only get a clean comparison when you have a (quasi-) experimental setup, such as an experiment or RD
key insight: rather than comparing $i \text{ to } j, \text{ compare } i \text{ in } t \text{ to } i \text{ in } t-1$

Post vs Pre Comparison

$\hat\tau=Y_{post}-Y_{pre}$ compares the unit to itself over time
consider a simple data generating process: $Y_t=\tau D_t+\beta X+ \gamma U$
- X: time invariant observable characteristics (eg race)
- Y: time invariant unobservable characteristics (eg preference)
Then: \[ \begin{aligned} \hat{\tau} &= Y_{\text{post}} - Y_{\text{pre}} = Y_{t=1} - Y_{t=0} \\ &= \tau(D_{t=1} - D_{t=0}) + \beta(X - \bar{X}) + \gamma(U - \bar{U}) \\ &= \tau(D_{t=1} - D_{t=0}) \\ &= \tau \end{aligned} \]
That is: all observable and unobservable characteristics as long as they are time invariant, get “differenced out,” leaving the causal effect $D_i$

What if you have time-varying characteristics?

consider: $Y_t=\tau D_t+\beta X+ \gamma U + \theta V_t$
- X: time invariant unobservables
- U: time variant unobservables
- V: time varying unobservables (suppose you haven’t “controlled for” all observables)
In this case: \[ \begin{aligned} \tau &= Y_{\text{post}} - Y_{\text{pre}} = Y_{t=1} - Y_{t=0} \\ &= \tau(D_{t=1} - D_{t=0}) + \beta(X - \bar{X}) + \gamma(U - \bar{U}) + \theta(V_{t=1} - V_{t=0}) \\ &= \tau(D_{t=1} - D_{t=0}) + \theta(V_{t=1} - V_{t=0}) \\ &= \tau + \theta(V_{t=1} - V_{t=0}) \end{aligned} \]
Any time-varying unobservables will create bias unless:
- V has no impact on the outcome
- V is not correlated with the treatment
SO: for the “post minus pre” estimator to be causal, we need to assume that there are no time varying unobservables that systematically correlate with the treatment
- again, we can never prove the ID assumption, since we cannot observe the counterfactual trend
  - if trend is totally flat $\rightarrow$ some degree of confidence that model captures all systematic determinants of Y for the pre-period
  - so, you have reason to think that the flat trends might continue in the post period in the absence of the treatment
  - but they don’t need to… so, any time-invariant unobservable that changes after the treatment can entirely ruin your inference.

Going Forward

Cross-Sectional estimator
- compare $i$ to $j$ (static)
- suffers from selection bias ($i$ and $j$ are systematically different)
Post vs pre estimator
- compare $i$ to itself over time
- Suffers from time-varying unobservables (AKA non-zero trends)
- KEY: treatment happened! probably for a reason
Difference in Difference (DD)
- uses across unit, within time comparisons and within-unit, across time comparisons
- essentially compares treated to untreated units over time

Difference in Difference

Difference-in-Differences Setup

Consider treated unit i’s data-generating process:

\[ Y_{it} = \tau D_t + \beta X_i + \gamma U_i + \theta V_t \]

Untreated unit j’s data-generating process:

\[ Y_{jt} = \beta X_j + \gamma U_j + \theta V_t \]

Variable Definitions:

$X$: time-invariant observable characteristics (e.g., gender)
$U$: time-invariant unobservable characteristics (e.g., preferences)
$V_t$: time-varying unobservable characteristics
(Suppose all observables are controlled for)

Post-vs-Pre Estimate for Treated Units:

\[ Y_{i,t=1} - Y_{i,t=0} = \tau(D_{t=1} - D_{t=0}) + \beta(X_i - X_i) + \gamma(U_i - U_i) + \theta(V_{t=1} - V_{t=0}) \]

Post-vs-Pre Estimate for Untreated Units:

\[ Y_{j,t=1} - Y_{j,t=0} = \beta(X_j - X_j) + \gamma(U_j - U_j) + \theta(V_{t=1} - V_{t=0}) \]

Difference-in-Differences (DD) Estimator:

\[ \widehat{\text{DD}}_\tau = (Y_{i,t=1} - Y_{i,t=0}) - (Y_{j,t=1} - Y_{j,t=0}) \\ = \tau(D_{t=1} - D_{t=0}) \\ = \tau \]

ID Assumption: counterfactual trend = untreated trend
- in other words, treated units trend would have evolves similarly as the untreated units trends in the absence of the treatment

EC 524 Notes

Mira Cross

Spring 2025