class: center, middle # Potential Outcomes ### Dr. Francisco J. Cabrera-Hernández #### EconometrÃa #### MaestrÃa en EconomÃa Primavera 2024 #####CIDE Santa Fe, Ciudad de México. --- ##Introduction - **Questions about questions** - Selection Bias - Random Assignment into treatment - Regression analysis of experiments - Bad Controls --- ## What is the causal relationship of interest? - A coherent, interesting, and doable research agenda is the solid foundation on which useful statistical analyses are built. - In the beginning, we should ask: **What is the causal relationship of interest?** - **Descriptive** research has a place in the world but the most interesting questions are causal. - It tells us what would happen in alternative (or "counterfactual") worlds. --- ## Experiments - The description of an ideal experiment helps you formulate causal questions precisely. - The mechanics of an ideal experiment highlight the forces you'd like to manipulate: "Find your source of variation". - In the case of schooling and wages, for example, we can imagine offering students a reward for staying at school. - In the case of political institutions, we might like to go back in time and randomly assign different government structures to former colonies! - Most experiments are hypothetical. But are worth contemplating because they help us pick fruitful research topics. --- ## Fundamentally Unidentified Questions (FUQ'd) - Questions about the causal effect of race or gender because these things are hard to isolate. - "Imagine your chromosomes were switched at birth" Not possible but, studies involve fake job applicants and resumes. <img src="data:image/png;base64,#cvs.jpg" width="40%" style="display: block; margin: auto;" /> --- ## Really FUQ'd! - Do children do better in primary if they start school earlier? - 7-year-old brain is better prepared. If some start at age 6 and others at age 7, we cannot compare them! - If we wait until they both are 7, some will be in second grade. - The effect of start age on elementary school test scores is FUQ'd. --- ## What is your identification strategy? - The manner in which a researcher uses observational data (i.e., data not generated by a randomized trial) to approximate a real experiment. - Quarter of birth is related with years of education! Angrist & Krueger (1996) - Credible identification strategies are **emblematic** of modern empirical work. <img src="data:image/png;base64,#variation.png" width="75%" style="display: block; margin: auto;" /> --- ## What is your mode of statistical inference? - The population to be studied. - The sample to be used. - The estimator (OLS, LAD, MLE, Non-parametric options) - The assumptions made when constructing standard errors (clustered or grouped). --- class: center, middle #The experimental ideal <img src="data:image/png;base64,#earth.png" width="80%" style="display: block; margin: auto;" /> --- #Overview - Questions about questions - **Selection Bias** - Random Assignment into treatment - Regression analysis of experiments - Bad Controls --- ## Selection Bias - The most credible research designs use random assignment. - Why are so powerful? Imagine we want to answer if visits in the last year to hospitals make people healthier? `$$Health_{it} = \alpha + D_1 Hospital_t + ... + U_{it}$$` What is the problem? <img src="data:image/png;base64,#health.png" width="115%" style="display: block; margin: auto;" /> - This suggests that going to the hospital makes people sicker (**self-selection problem**). --- ## Formally - Hospital: `\(D_i = {0,1}\)` - Protential Outcome: `$$Y_{1i} | D_i = 1$$` `$$Y_{0i} | D_i = 0$$` <img src="data:image/png;base64,#ATE1.png" width="50%" style="display: block; margin: auto;" /> - `\(Y_{1i} - Y_{0i}\)` is the causal effect of hospitalization for an individual. - But we never see both potential outcomes for any one person. --- ##Naive Comparision A naive comparison of averages by hospitalization: <img src="data:image/png;base64,#ATE2.png" width="125%" style="display: block; margin: auto;" /> Where: `\(E[Y_{1i}|D_i=1]-E[Y_{0i}|D_i=1]=E[Y_{1i}-Y_{0i}|D_i = 1]\)` --- ##Naive Comparision <img src="data:image/png;base64,#ATE2.png" width="125%" style="display: block; margin: auto;" /> - The average causal effect comes from those who were hospitalized `\(E[Y_{1i}|D_i=1]\)` had they NOT being hospitalized `\(E[Y_{0i}|D_i=1]\)` - *Selection bias* is the difference in average `\(Y_{0i}\)` between those who were and were NOT hospitalized. - Those who were not hospitalized have better health `\(E[Y_{0i}|D_i=0]\)`, making selection bias negative. - The selection bias may be so large (in absolute value) that it completely masks a positive treatment effect. --- #Overview - Questions about questions - Selection Bias - **Random Assignment into treatment** - Regression analysis of experiments - Bad Controls --- ##St. Random Assignment into treatment - Makes `\(D_i\)` independent of potential outcomes: <img src="data:image/png;base64,#ATE3.png" width="100%" style="display: block; margin: auto;" /> - Independence of `\(Y_{0i}\)` and `\(D_i\)` allows us to write `\(E[Y_{0i} | D_i = 1]\)` instead of `\(E[Y_{0i}|D_i = 0]\)` in the second line. Simplifies to: <img src="data:image/png;base64,#ATE4.png" width="85%" style="display: block; margin: auto;" /> This is the **Average Treatment Effect of hospitalization on randomly chosen patients**. --- ## More on selection bias - An iconic example from labor economics is the evaluation of government-subsidized training programs - The idea is to increase employment and earnings. Yet **non-experimental** evidence shows that the trainees earn less than plausible comparison groups. - Another important question has been "What is the effect of class size on students' achievement?" What is the identification problem? - How would you solve this experimentally? how would you do it non-experimentally? --- ##Internal Validity vs. External Validity - Experiments have a strong *internal validity*, yet are most of the times impractical. - High costs, long duration, and low *external validity* or low generalization. - There are other problems such as the "Howthorne effect" <img src="data:image/png;base64,#howth.png" width="40%" style="display: block; margin: auto;" /> - Nevertheless, a notional randomized trial is our benchmark. Not all researchers share this view, but many do. --- ## Distribution of an RCT - Tutores en secundaria <img src="data:image/png;base64,#tutores.jpg" width="80%" style="display: block; margin: auto;" /> --- #Distribution of an RCT - Tutores en secundaria <img src="data:image/png;base64,#notas1.png" width="100%" style="display: block; margin: auto;" /> --- #Distribution of an RCT - Tutores en secundaria <img src="data:image/png;base64,#notas2.png" width="70%" style="display: block; margin: auto;" /> --- #Distribution of an RCT (with small n) - Tutores en secundaria <img src="data:image/png;base64,#notas3.png" width="70%" style="display: block; margin: auto;" /> --- #Overview - Questions about questions - Selection Bias - Random Assignment into treatment - **Regression analysis of experiments** - Bad Controls --- ##Regression Analysis of Experiments - Suppose that the treatment effect is the same for everyone: `\(y_{1i} - y_{0i} = \rho\)` <img src="data:image/png;base64,#reg1.png" width="90%" style="display: block; margin: auto;" /> - Where `\(\eta_i\)` is the random part of `\(Y_{0i}\)`. `$$E[Y_{i}|D_i =1] = \alpha + \rho + E[\eta_i|D_i=1]$$` `$$E[Y_{i}|D_i =0] = \alpha + E[\eta_i|D_i=0]$$` hence: `$$E[Y_{i}|D_i =1]-E[Y_{i}|D_i =0] = \rho + E[\eta_i|D_i=1] - E[\eta_i|D_i=0]$$` --- ##Regression Analysis of Experiments - Where: `\(E[\eta_i|D_i=1] - E[\eta_i|D_i=0]\)` is the selection bias. The correlation between regression error `\(\eta_i\)` and the regresor, `\(D_i\)`. - If `\(D_i\)` is randomly assigned, the selection term disappears and `\(\rho\)` is the causal effect. Hence: `$$Y_i = \alpha + \rho D_i + X'_i\sigma + \eta_i$$` - Adding covariates should not change `\(\rho\)` as these are "balanced" between treated and untreated `\(i\)`. - But they reduce the residual variance. --- ##The STAR Experiment <img src="data:image/png;base64,#star.png" width="85%" style="display: block; margin: auto;" /> --- ##The STAR Experiment <img src="data:image/png;base64,#star1.png" width="75%" style="display: block; margin: auto;" /> --- ##Regression and Causality - When can we think of a regression coefficient as approximating the causal effect that might be revealed in an experiment? - A regression is causal when the CEF it approximates is causal. <img src="data:image/png;base64,#cef.png" width="55%" style="display: block; margin: auto;" /> - CEF is causal when it describes differences in average potential outcomes for a fixed reference population. --- ##Regression and Causality - The causal connection between schooling and earnings can be defined as the functional relationship that describes what a given individual `\(i\)` would earn if she obtained different levels of education. - In empirical work, the causal relationship between schooling and earnings tells us what people would earn *on average* if we could change their schooling keeping the rest fixed. - **Or change their schooling randomly so that those with different levels of schooling would be comparable**. - This leads to the *conditional independence assumption (CIA)*, that provides the justification for the causal interpretation of regression. - This assumption is sometimes called *selection-on-observables* because the covariates to be held fixed are observed. --- ##Regression and Causality - The causal relationship between college attendance and a future earnings can be described using the potential-outcomes notation. - In this case, `\(Y_{0i}\)` is earnings with no college, while `\(Y_{1i}\)` is earnings with college. <img src="data:image/png;base64,#potentialcollege.png" width="80%" style="display: block; margin: auto;" /> --- ## Regression and Causality - We get to see one of `\(Y_{0i}\)` or `\(Y_{1i}\)`, but never both. - We therefore hope to measure the average of `\(Y_{1i}\)` - `\(Y_{0i}\)`, yet we have a bias. - It seems likely that those who go to college would have earned more anyway. - If so, selection bias is positive, and the naive comparison exaggerates the "college premium". --- ##Regression and Causality - Hence: the **CIA** states that: `\([Y_{1i} , Y_{0i}] \perp C_i | X_i\)` - Or: `\(E[Y_i | X_i, C_i = 1] - E[Y_i|X_i, C_i=0] = E[Y_{1i}-Y_{0i}|X_i]\)` - i.e., potential outcomes (wages) of people who went to college, and those who did not go, are independent of going to college, once we control for `\(X_i\)`. - This `\(X_i\)` is "the door" that fully explains why someone "is treated". This is the base of some estimators such as Propensity Score Matching (PSM) or Heckman Probit. - Such estimators are not longer used (in modern econometrics). **CIA is highly implausible**. --- #Overview - Questions about questions - Selection Bias - Random Assignment into treatment - Regression analysis of experiments - **Bad Controls** --- ##Bad Controls: More is better? - Control for covariates can make the CIA more plausible. But not always more is better. - Bad controls are variables that are themselves *outcome variables* in the notional experiment at hand. - Good controls are variables that we can think of as having been fixed at the time the regressor of interest was determined. - The essence of the bad control problem is a version of **selection bias**. --- ##Bad Controls - For example, a college degree opens the door to higher-paying white collar jobs. - Should occupation therefore be seen as an omitted variable in a regression of wages on schooling? - If college affects occupation, comparisons of wages by college degree status within an occupation are no longer apples-to-apples - ***This is even if college degree completion is randomly assigned***. --- ## Bad Controls Formally - `\(W_i\)` for white-collar workers and `\(Y_i\)` for wage (outcome). Asume college `\(C_i\)` randomly assigned. <img src="data:image/png;base64,#bad1.png" width="70%" style="display: block; margin: auto;" /> - We might estimate these average treatment effects by simply regressing `\(Y_i\)` and `\(W_i\)` on `\(C_i\)`. - Bad control: a comparision of earnings conditional on `\(W_i\)` does not have a causal interpretation. - Consider the difference in mean earnings between college graduates and others conditional on working at a white collar job... --- ## Bad Controls Formally - The estimand for people with withe-collar job is the difference in means with `\(C_i\)` switched on and off, conditional on `\(W_i\)` = 1: <img src="data:image/png;base64,#bad2.png" width="120%" style="display: block; margin: auto;" /> by the joint independence of `\([Y_i, W_i, Y_o, Wo]\)` and `\(C_i\)` we have: <img src="data:image/png;base64,#bad3.png" width="120%" style="display: block; margin: auto;" /> - This expresion denotes an apples-to-orange comparison: <img src="data:image/png;base64,#bad4.png" width="100%" style="display: block; margin: auto;" /> - Someone who gets a white collar without benefit of a college degree (i.e., `\(W_{0i} = 1\)`) is probably special, i.e., has a better than average `\(y_{0i}\)`. --- ## Bad Controls - In Column (5) we no longer know if the reduction in the coefficient is due to ommited variable bias, or due to (self)selection into occupations affected by college. <img src="data:image/png;base64,#returns.png" width="80%" style="display: block; margin: auto;" /> --- ## Bad Controls: Proxy Variable. - A second version of the bad control scenario involves proxy controls that are affected by the variable of interest. - Is is OK to control for a IQ in a regression of wages on education if this is measured before school. Otherwise: <img src="data:image/png;base64,#badproxy.png" width="60%" style="display: block; margin: auto;" /> - Where `\(a_{i}\)` is innate ability and `\(a_{li}\)` is abbility measured later, after school. So when substituting `\(a_{i}\)` with `\(a_{li}\)` we attenuate the effect `\(\rho\)`, unless `\(\pi_1\)` = 0. - Clear reasoning about causal channels requires explicit assumptions about what happened first, or the assertion that none of the control variables are themselves caused by the regressor of interest. --- class: center, middle #THE END