class: center, middle # "Identificación y Experimentación" ### Econometría 1 ### Dr. Francisco Cabrera #### Centro de Investigación y Docencia Económicas A.C. (CIDE) November, 2023 --- #Overview - **Questions about questions** - Selection Bias - Random Assignment into treatment - Regression analysis of experiments - Bad Controls --- ## What is the causal relationship of interest? - A coherent, interesting, and doable research agenda is the solid foundation on which useful statistical analyses are built. - In the beginning, we should ask: **What is the causal relationship of interest?** - Descriptive research has a place in the world but the most interesting questions are causal. - It tells us what would happen in alternative (or "counterfactual") worlds. --- ## Experiments - The description of an ideal experiment helps you formulate causal questions precisely. - The mechanics of an ideal experiment highlight the forces you'd like to manipulate: "Find your source of variation". - In the case of schooling and wages, for example, we can imagine offering potential dropouts a reward for finishing school. - In the case of political institutions, we might like to go back in time and randomly assign different government structures to former colonies! - Most experiments are hypothetical. But are worth contemplating because they help us pick fruitful research topics. --- ## Fundamentally Unidentified Questions (FUQ'd) - Questions about the causal effect of race or gender seem like good candidates because these things are hard to isolate. - "Imagine your chromosomes were switched at birth" Not possible but, studies involve fake job applicants and resumes. <img src="data:image/png;base64,#cvs.jpg" width="40%" style="display: block; margin: auto;" /> --- ## Really FUQ'd! - Do children do better in primary if the start school earlier? - 7-year-old brain is better prepared. If some start at age 6 and others at age 7, we cannot compare them! - If we wait until they both are 7, some will be in second grade. - The effect of start age on elementary school test scores is FUQ'd. --- ## What is your identiffcation strategy? - The manner in which a researcher uses observational data (i.e., data not generated by a randomized trial) to approximate a real experiment. - Quarter of birth is related with years of education! Angrist & Krueger (1996) - Credible identification strategies are emblematic of modern empirical work. <img src="data:image/png;base64,#variation.png" width="50%" style="display: block; margin: auto;" /> --- ## What is your mode of statistical inference? - The population to be studied, - The sample to be used, - The assumptions made when constructing standard errors (clustered or grouped). --- class: center, middle #The experimental ideal <img src="data:image/png;base64,#earth.png" width="80%" style="display: block; margin: auto;" /> --- #Overview - Questions about questions - **Selection Bias** - Random Assignment into treatment - Regression analysis of experiments - Bad Controls --- ## "Selection Bias" - The most credible and influential research designs use random assignment. - e.g. The Perry Project in 1962 that brought the Head Start Program. - Why are so powerful? Imagine we want to answer if visits in the last year to hospitals make people healthier? `$$Health_{it} = \alpha + D_1 Hospital_t + ... + U_{it}$$` What is the problem? <img src="data:image/png;base64,#health.png" width="100%" style="display: block; margin: auto;" /> - This suggests that going to the hospital makes people sicker (**self-selection problem**). --- ## "Formally" - Hospital: `\(D_i = {0,1}\)` - Protential Outcome: `$$Y_{1i} | D_i = 1$$` `$$Y_{0i} | D_i = 0$$` <img src="data:image/png;base64,#ATE1.png" width="40%" style="display: block; margin: auto;" /> - `\(Y_{1i} - Y_{0i}\)` is the causal effect of hospitalization for an individual. - But we never see both potential outcomes for any one person --- ##Naive Comparision A naive comparison of averages by hospitalization: <img src="data:image/png;base64,#ATE2.png" width="100%" style="display: block; margin: auto;" /> Where: `$$E[Y_{1i}|D_i=1]-E[Y_{0i}|D_i=1]=E[Y_{1i}-Y_{0i}|D_i = 1]$$` Is the average causal effect on those who were hospitalized `\(E[Y_{1i}|D_i=1]\)` had they NOT being hospitalized `\(E[Y_{0i}|D_i=1]\)` -*Selection bias* is the difference in average `\(Y_{0i}\)` between those who were and were NOT hospitalized. -Those who were hospitalized have worse `\(y_{0i}\)`, making selection bias negative. -The selection bias may be so large (in absolute value) that it completely masks a positive treatment effect. --- #Overview - Questions about questions - Selection Bias - **Random Assignment into treatment** - Regression analysis of experiments - Bad Controls --- ##St. Random Assignment into treatment - Makes `\(D_i\)` independent of potential outcomes: <img src="data:image/png;base64,#ATE3.png" width="70%" style="display: block; margin: auto;" /> - Where the independence of `\(y_{0i}\)` and `\(D_i\)` allows us to write `\(E[y_{0i} | D_i = 1]\)` instead of `\(E[y_{0i}|D_i = 0]\)` in the second line. Simplifies to: <img src="data:image/png;base64,#ATE4.png" width="70%" style="display: block; margin: auto;" /> This is the **Average Treatment Effect of hospitalization on randomly chosen patients**. --- # More on selection bias - An iconic example from labor economics is the evaluation of government-subsidized training programs - The idea is to increase employment and earnings. Yet **non-experimental** evidence shows that the trainees earn less than plausible comparison groups. -Why is that? - Another important question has been "What is the effect of class size on students' achievement? What is the identification problem? - How would you solve this experimentally? how would you do it non-experimentally? --- ##Internal Validity vs. External Validity - Experiments have a strong *internal validity*, yet are most of the times impractical. - High costs, long duration, and low *external validity* or low generalization. - There are other problems such as the "Howthorne effect" <img src="data:image/png;base64,#howth.png" width="40%" style="display: block; margin: auto;" /> - Nevertheless, a notional randomized trial is our benchmark. Not all researchers share this view, but many do. --- #Distribution of an RCT - Tutores en secundaria <img src="data:image/png;base64,#tutores.jpg" width="80%" style="display: block; margin: auto;" /> --- #Distribution of an RCT - Tutores en secundaria <img src="data:image/png;base64,#notas1.png" width="100%" style="display: block; margin: auto;" /> --- #Distribution of an RCT - Tutores en secundaria <img src="data:image/png;base64,#notas2.png" width="70%" style="display: block; margin: auto;" /> --- #Distribution of an RCT (with small n) - Tutores en secundaria <img src="data:image/png;base64,#notas3.png" width="70%" style="display: block; margin: auto;" /> --- #Overview - Questions about questions - Selection Bias - Random Assignment into treatment - **Regression analysis of experiments** - Bad Controls --- ##Regression Analysis of Experiments - Suppose that the treatment effect is the same for everyone: `\(y_{1i} - y_{0i} = \rho\)` <img src="data:image/png;base64,#reg1.png" width="80%" style="display: block; margin: auto;" /> - Where `\(\eta_i\)` is the random part of `\(Y{0i}\)` `$$E[Y_{i}|D_i =1] = \alpha + \rho + E[\eta_i|D_i=1]$$` `$$E[Y_{i}|D_i =0] = \alpha + E[\eta_i|D_i=0]$$` hence: `$$E[Y_{i}|D_i =1]-E[Y_{i}|D_i =0] = \rho + E[\eta_i|D_i=1] - E[\eta_i|D_i=0]$$` --- ##Regression Analysis of Experiments - Where: `\(E[\eta_i|D_i=1] - E[\eta_i|D_i=0]\)` is the selection bias. The correlation between regression error `\(\eta_i\)` and the regresor, `\(D_i\)` - If `\(D_i\)` is randomly assigned, the selection term disappears and `\(\rho\)` is the causal effect. Hence: `$$Y_i = \alpha + \rho D_i + X'_i\sigma + \eta_i$$` - Integration of covariates should not change `\(\rho\)` as these are "balanced" between treated and untreated `\(i\)`. - But they reduce the residual variance. --- ##The STAR Experiment <img src="data:image/png;base64,#star.png" width="85%" style="display: block; margin: auto;" /> --- ##The STAR Experiment <img src="data:image/png;base64,#star1.png" width="75%" style="display: block; margin: auto;" /> --- ##Regression and Causality - When can we think of a regression coefficient as approximating the causal effect that might be revealed in an experiment? - A regression is causal when the CEF it approximates is causal. <img src="data:image/png;base64,#cef.png" width="55%" style="display: block; margin: auto;" /> - CEF is causal when it describes differences in average potential outcomes for a fixed reference population. --- ##Regression and Causality - The causal connection between schooling and earnings can be defined as the functional relationship that describes what a given individual `\(i\)` would earn if he or she obtained different levels of education. - In empirical work, the causal relationship between schooling and earnings tells us what people would earn *on average* if we could either change their schooling in a perfectly-controlled environment, - **Or change their schooling randomly so that those with different levels of schooling would be comparable**. - This leads to the *conditional independence assumption (CIA)*, a core assumption that provides the justification for the causal interpretation of regression. - This assumption is sometimes called *selection-on-observables* because the covariates to be held fixed are observed. --- ##Regression and Causality - The causal relationship between college attendance and a future outcome like earnings can be described using the potential-outcomes notation - In this case, `\(Y_{0i}\)` is earnings without college, while `\(Y_{1i}\)` is earnings if she goes. <img src="data:image/png;base64,#potentialcollege.png" width="70%" style="display: block; margin: auto;" /> - We get to see one of `\(Y_{0i}\)` or `\(Y_{1i}\)`, but never both. - We therefore hope to measure the average of `\(Y_{1i}\)`- `\(Y_{0i}\)`, or the average for some group, such as those who went to college. This is: `\(E[Y_{1i} - Y_{0i}| C_i = 1]\)`, yet we have a bias. --- ##Regression and Causality - It seems likely that those who go to college would have earned more anyway. If so, selection bias is positive, and the naive comparison exaggerates the "college premium". - The CIA states that: `\([Y_{1i} , Y_{0i}] \perp C_i | X_i\)` - Or: `\(E[Y_i | X_i, C_i = 1] - E[Y_i|X_i, C_i=0] = E[Y_{1i}-Y_{0i}|X_i]\)` - In other words, potential outcomes (wages) of people who went to college, and those who did not go, are independent of going to college, once we control for X. --- #Overview - Questions about questions - Selection Bias - Random Assignment into treatment - Regression analysis of experiments - **Bad Controls** --- ##Bad Controls: More is better? - Control for covariates can make the CIA more plausible. But not always more is better - Bad controls are variables that are themselves *outcome variables* in the notional experiment at hand. - Good controls are variables that we can think of as having been fixed at the time the regressor of interest was determined. - The essence of the bad control problem is a version of selection bias. - A college degree opens the door to higher-paying white collar jobs. Should occupation therefore be seen as an omitted variable in a regression of wages on schooling? - If college affects occupation, comparisons of wages by college degree status within an occupation are no longer apples-to-apples ***even if college degree completion is randomly assigned***. --- ## Bad Controls Formally - `\(W_i\)` for white-collar workers and `\(Y_i\)` for wage (outcome). Asume college `\(C_i\)` randomly assigned. <img src="data:image/png;base64,#bad1.png" width="70%" style="display: block; margin: auto;" /> - We might estimate these average treatment effects by simply regressing `\(Y_i\)` and `\(W_i\)` on `\(C_i\)`. - Bad control: a comparision of earnings conditional on `\(W_i\)` does not have a causal interpretation. - Consider the difference in mean earnings between college graduates and others conditional on working at a white collar job... --- ## Bad Controls Formally - The estimand for people with withe-collar job is the difference in means with `\(C_i\)` switched on and off, conditional on `\(W_i\)` = 1: <img src="data:image/png;base64,#bad2.png" width="100%" style="display: block; margin: auto;" /> by the joint independence of `\([Y_i, W_i, Y_o, Wo]\)` and `\(C_i\)` we have: <img src="data:image/png;base64,#bad3.png" width="100%" style="display: block; margin: auto;" /> - This expresion denotes an apples-to-orange comparison: <img src="data:image/png;base64,#bad4.png" width="80%" style="display: block; margin: auto;" /> - Someone who gets a white collar without benefit of a college degree (i.e., `\(W_{0i} = 1\)`) is probably special, i.e., has a better than average `\(y_{0i}\)`. --- ## Bad Controls - In Column (5) we no longer know if the reduction in the coefficient is due to ommited variable bias, or due to (self)selection into occupations affected by college. <img src="data:image/png;base64,#returns.png" width="80%" style="display: block; margin: auto;" /> --- ## Bad Controls: Proxy Variable. - A second version of the bad control scenario involves proxy control, that is, the inclusion of variables that might partially control for omitted factors, but are themselves affected by the variable of interest. - Is is ok to control for a IQ in a regression of wages on education if this is measured before school. Otherwise: <img src="data:image/png;base64,#badproxy.png" width="60%" style="display: block; margin: auto;" /> - Where `\(a_{i}\)` is innate ability and `\(a_{li}\)` is abbility measured later, after school. So when substituting `\(a_{i}\)` with `\(a_{li}\)` we attenuate the effect `\(\rho\)`, unless `\(\pi_1\)` = 0. - Clear reasoning about causal channels requires explicit assumptions about what happened first, or the assertion that none of the control variables are themselves caused by the regressor of interest. --- class: center, middle #THE END