Linear Regression Review

  • change change
  • test test will this work

Regression vs Classification

  • Regression: \(f(X)=E[Y|X]\)
    • conditional expectation of Y given X
  • Classification: \(f(X)=Pr[Y=\text {label}|X]\)
    • conditional probability that y takes on a given label, given X
  • why conditional expectations?
    • \(E[Y|X]\) minimizes the mean squared error
    • \(E[\epsilon |X]=0\) is uncorrelated with any function of X
    • we have broken Y into a component explained by X, and another component that is orthogonal to X
  • linear regression goal: find the best linear approximation of \(E[Y|X]\) to minimize the mean squared error between prediction of Y and sum of actual values of Y observed at each point, estimated by \(E[Y|X]=\alpha + \beta X\)

Ordinary Least Squares

  • estimate linear regression using OLS, which finds the values of parameters to minimize prediction errors
    • choose \(\alpha, \beta\) to minimize the Residual Sum of Squares (RSS) \[ RSS = \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 \] where \[ \hat{\beta}=\frac {cov(x,y)}{var(X)} \]
  • key assumption behind OLS: \(E(\epsilon|X)=0\)
    • re: the difference between X and Y is effectively random, and everything else in the world that explains Y (aside from X) is uncorrelated with X (aka no omitted variable bias!)
  • key assumption behind hypothesis testing in OLS: an individual’s error variance cannot tell me anything about another individual’s error variance
    • no correlation of epsilon across individuals in our sample -> overestimation of the degree to which including X in your model explains the variation of Y

Linear Regression in R

  • use “binscattering” in R to produce more readable figures when there is a lot of data
  • the underlying relationship stays the same, with the linear OLS estimation remaining constant across the original and binned data
  • easier to visualize whether the data should be modeled linearly, quadratically, etc.

Group Means

  • sample means
    • regressing \(Y\) on only an intercept gives the sample mean
    • lm_1 <- lm(int_rt ~ 1, data = loan_data)
  • group means (no intercept)
    • regressing outcome on factor variable (no intercept) gives group means directly
    • lm_bin_0 <- lm(int_rt ~ 0 + fico_bin, data = loan_data)
  • group means (with intercept)
    • alternative approach: include intercept, omit one group
    • coefficients then represent the difference from the omitted category
    • intercept represents the mean of the omitted group
    • lm_bin_1 <- lm(int_rt ~ 1 + fico_bin, data = loan_data)

Multiple Linear Regression

  • multiple input variables, \(X_1 \text { and } X_2\)
  • Risk of overfitting, as adding too many irrelevant predictors can model noise instead of true patterns
  • introduces the bias-variance tradeoff
    • increasing the number of predictors \(\rightarrow\) reduction in bias, higher variance
    • fewer predictors \(\rightarrow\) more bias, less variance
    • tendency to prefer reducing bias, and accepting higher variance
  • coefficient plots
    • useful for visualizing coefficients either within the same model, or the same coefficient across different models
  • coefficient tests
    • when considering coefficients across models, we sometimes want to test if those coefficients are the same
    • use the Wald linear test of coefficients
    • testinga cross models on the same sample requires “simultaneous” estimation of the models
    • be sure to use heteroskedasticicity-robust standard errors

Causality and Randomized Experiments

Correlation vs Causation

RCTs: Conceptual Framework

RCTs: Regression Implementation

Causal Inference

Oregon Health Plan Experiment

Attrition Bias

  • differential attrition: assignment to treatment impacts attrition
  • selective attrition: attrition based on some characteristic, realized ex post
  • using randomized number of times needed to contact people to ensure that making sure that those who are harder to reach (ie causing attrition) are not systematically the same in some way. called randomized outreach approach.

Scale-Up Bias

  • scaling up from small experiments to big policy is harder than expected
    • small scale results can differ because of differences in participants (and the way that information travels), and implementers may differ (the quality of implementers may vary based on the scale, which changes as the policy scales up)
    • related to general equilibrium effects
      • ex: giving everyone a carpool sticker to everyone -> reduced value of treatment
      • part of the treatment in a small experiment is the fact that it is exclusive, so when it is scaled up, it doesn’t have the same effect

Statistical Power

  • larger sample \(\rightarrow\) more precise estimate \(\rightarrow\) closer to objective truth

  • chance that experiment will fail to find significant effects, even if treatment effect really does exist

    • chance is large with small sample sizes, which diminishes as the sample size increases

Non-Experimental Data

Two views of the world when looking at natural experiments:

  1. Selection on Observables (SOO): treated and untreated units vary in ways that we can observe
    • fixing selection on observables often makes selection on unobservables worse

    • “last resort design”

    • randomization requires \(Y_i (1), Y_i (0) \rightarrow D_i\)

    • SOO Assumes: \(Y_i (1), Y_i (0) \rightarrow D_i | X_i\)

  2. Selection on Unobservables: treated and untreated units differ in ways that we cannot observe

The Overlap Assumption

  • once we have conditioned on X, D is as good as random

Propensity Score Method

  • modern approach to propensity score matching” is OLS with inverse probability weighting

    • instead of picking a neighbor or looking for an exact match, we can weight each observation according to its inverse probability (more weight on what happens when someone who likely shouldn’t be in the sample is in the sample)

    • treated units assigned weight $ $

    • untreated units assigned weight $ $

Regression Discontinuity

  • approach to causal ID which relies on rules

  • idea: individuals on either side of the cutoff are essentially identical, but the treatment they receive differs because of some arbitrary assignment/cutoff rule

    • causal effect is the difference in outcomes between those on either side of the cutoff
  • four “ingredients”

    • causal question of interest: key element for causal data analysis

    • outcome variable (Y)

    • treatment variable (T)

    • assignment variable (X, also called “running” var)

      • assignment variable is used to determine treatment

      • if X exceeds a certain threshold, treatment is given

  • Sharp regression discontinuity

    • to get \(\tau\), compare units with \(D_i=0 \text{ and }D_i=1\) exactly at the cutoff

      \[ \tau^{RD}=E[Y_i(1)-Y_i(0)|X_i=c] \]

    • but since we cannot observe the counterfactual, we can estimate this by taking the limit as x approaches c

      \[ \hat\tau^{RD} = \text{lim}E[Y_i(1)|X_i=x]-\text{lim}E[Y_i(0)|X_i=x] \]

Identification Assumptions

  • RD comes close to mimicking random assignment, but without true randomization, we need to satisfy certain identification assumptions

    • everything moves smoothly around cutoff c, barring the discontinuous jump as a result of the cutoff
    • the change in \(D_i\) is the only reason for discrete jumps in \(Y_i\) around cutoff c
  • Testing the ID assumptions

    • Manipulation Test

      • we are assuming that \(X_i-c\) (ie how far you are from the cutoff) is as good as randomly assigned in the neighborhood of c

      • if this is true, then units cannot sort around c (ie students cannot manipulate scores, politicians cannot change voter support)

      • we test this by looking at the distribution of \(X_i\)

      • if we seen “strange” behavior around the cutoff, we worry that there is manipulation, where people just below the cutoff can do something to raise themselves just over the cutoff to receive the treatment