Linear Regression Review
- change change
- test test will this work
Regression vs Classification
- Regression: \(f(X)=E[Y|X]\)
- conditional expectation of Y given X
- Classification: \(f(X)=Pr[Y=\text {label}|X]\)
- conditional probability that y takes on a given label, given X
- why conditional expectations?
- \(E[Y|X]\) minimizes the mean
squared error
- \(E[\epsilon |X]=0\) is
uncorrelated with any function of X
- we have broken Y into a component explained by X, and another
component that is orthogonal to X
- linear regression goal: find the best linear approximation of \(E[Y|X]\) to minimize the mean squared error
between prediction of Y and sum of actual values of Y observed at each
point, estimated by \(E[Y|X]=\alpha + \beta
X\)
Ordinary Least Squares
- estimate linear regression using OLS, which finds the values of
parameters to minimize prediction errors
- choose \(\alpha, \beta\) to
minimize the Residual Sum of Squares (RSS) \[
RSS = \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 \] where \[ \hat{\beta}=\frac {cov(x,y)}{var(X)}
\]
- key assumption behind OLS: \(E(\epsilon|X)=0\)
- re: the difference between X and Y is effectively random, and
everything else in the world that explains Y (aside from X) is
uncorrelated with X (aka no omitted variable bias!)
- key assumption behind hypothesis testing in OLS: an individual’s
error variance cannot tell me anything about another individual’s error
variance
- no correlation of epsilon across individuals in our sample ->
overestimation of the degree to which including X in your model explains
the variation of Y
Linear Regression in R
- use “binscattering” in R to produce more readable figures when there
is a lot of data
- the underlying relationship stays the same, with the linear OLS
estimation remaining constant across the original and binned data
- easier to visualize whether the data should be modeled linearly,
quadratically, etc.
Group Means
- sample means
- regressing \(Y\) on only an
intercept gives the sample mean
lm_1 <- lm(int_rt ~ 1, data = loan_data)
- group means (no intercept)
- regressing outcome on factor variable (no intercept) gives
group means directly
lm_bin_0 <- lm(int_rt ~ 0 + fico_bin, data = loan_data)
- group means (with intercept)
- alternative approach: include intercept, omit one group
- coefficients then represent the difference from the omitted
category
- intercept represents the mean of the omitted group
lm_bin_1 <- lm(int_rt ~ 1 + fico_bin, data = loan_data)
Multiple Linear Regression
- multiple input variables, \(X_1 \text {
and } X_2\)
- Risk of overfitting, as adding too many irrelevant predictors can
model noise instead of true patterns
- introduces the bias-variance tradeoff
- increasing the number of predictors \(\rightarrow\) reduction in bias, higher
variance
- fewer predictors \(\rightarrow\)
more bias, less variance
- tendency to prefer reducing bias, and accepting higher variance
- coefficient plots
- useful for visualizing coefficients either within the same model, or
the same coefficient across different models
- coefficient tests
- when considering coefficients across models, we sometimes want to
test if those coefficients are the same
- use the Wald linear test of coefficients
- testinga cross models on the same sample requires “simultaneous”
estimation of the models
- be sure to use heteroskedasticicity-robust standard errors
Causality and Randomized Experiments
Correlation vs Causation
RCTs: Conceptual Framework
RCTs: Regression Implementation
Causal Inference
Oregon Health Plan Experiment
Attrition Bias
- differential attrition: assignment to treatment
impacts attrition
- selective attrition: attrition based on some
characteristic, realized ex post
- using randomized number of times needed to contact people to ensure
that making sure that those who are harder to reach (ie causing
attrition) are not systematically the same in some way. called
randomized outreach approach.
Scale-Up Bias
- scaling up from small experiments to big policy is harder than
expected
- small scale results can differ because of differences in
participants (and the way that information travels), and implementers
may differ (the quality of implementers may vary based on the scale,
which changes as the policy scales up)
- related to general equilibrium effects
- ex: giving everyone a carpool sticker to everyone -> reduced
value of treatment
- part of the treatment in a small experiment is the fact that it is
exclusive, so when it is scaled up, it doesn’t have the same effect
Statistical Power
larger sample \(\rightarrow\)
more precise estimate \(\rightarrow\)
closer to objective truth
chance that experiment will fail to find significant effects,
even if treatment effect really does exist
- chance is large with small sample sizes, which diminishes as the
sample size increases
Non-Experimental Data
Two views of the world when looking at natural experiments:
- Selection on Observables (SOO): treated and untreated units vary in
ways that we can observe
fixing selection on observables often makes selection on
unobservables worse
“last resort design”
randomization requires \(Y_i (1), Y_i
(0) \rightarrow D_i\)
SOO Assumes: \(Y_i (1), Y_i (0)
\rightarrow D_i | X_i\)
- Selection on Unobservables: treated and untreated units differ in
ways that we cannot observe
The Overlap Assumption
- once we have conditioned on X, D is as good as random
Regression Discontinuity
approach to causal ID which relies on rules
idea: individuals on either side of the cutoff are essentially
identical, but the treatment they receive differs because of some
arbitrary assignment/cutoff rule
- causal effect is the difference in outcomes between those on either
side of the cutoff
four “ingredients”
causal question of interest: key element for causal data
analysis
outcome variable (Y)
treatment variable (T)
assignment variable (X, also called “running” var)
assignment variable is used to determine treatment
if X exceeds a certain threshold, treatment is given
Sharp regression discontinuity
to get \(\tau\), compare units
with \(D_i=0 \text{ and }D_i=1\)
exactly at the cutoff
\[
\tau^{RD}=E[Y_i(1)-Y_i(0)|X_i=c]
\]
but since we cannot observe the counterfactual, we can estimate
this by taking the limit as x approaches c
\[
\hat\tau^{RD} = \text{lim}E[Y_i(1)|X_i=x]-\text{lim}E[Y_i(0)|X_i=x]
\]
Identification Assumptions
RD comes close to mimicking random assignment, but without true
randomization, we need to satisfy certain identification
assumptions
- everything moves smoothly around cutoff c, barring the
discontinuous jump as a result of the cutoff
- the change in \(D_i\) is the only
reason for discrete jumps in \(Y_i\)
around cutoff c
Testing the ID assumptions
Manipulation Test
we are assuming that \(X_i-c\)
(ie how far you are from the cutoff) is as good as randomly assigned in
the neighborhood of c
if this is true, then units cannot sort around c (ie students
cannot manipulate scores, politicians cannot change voter
support)
we test this by looking at the distribution of \(X_i\)
if we seen “strange” behavior around the cutoff, we worry that
there is manipulation, where people just below the cutoff can do
something to raise themselves just over the cutoff to receive the
treatment