Karim Naguib (Boston University)
10/27/2013
Two components of internal validity 1. The estimator of the causal effect should be unbiased and consistent. 2. Hypothesis tests should have the desired significance level, and confidence intervals should have the desired confidence level.
If the omitted variable is observed or there are adequate control variables, we can simply add them to the regression. Steps to decide on whether to include a variable or not
Suppose there was a mix-up and we ended up using test scores data from 10th grades instead of the 5th graders we are interested in. While the two scores might be correlated, they are not exactly the same and this leads to errors-in-variables bias.
Suppose we are interested in regressing on the variables \( X_i \) (e.g. actual earnings), but we only observe an imprecisely measured \( \tilde{X}_i \) (e.g. reported earnings). The population regression model is \[ Y_i = \beta_0 + \beta_1 X_i + u_i \]
But with the imprecisely measured \( \tilde{X}_i \) we have \[ \begin{align*} Y_i &= \beta_0 + \beta_1 \tilde{X}_i + [\beta_1(X_i - \tilde{X}_i) + u_i] \\ &= \beta_0 + \beta_1 \tilde{X}_i + v_i \end{align*} \]
If \( \tilde{X}_i \) is correlated with \( (X_i - \tilde{X}_i) \) we will have OVB.
\[ \tilde{X}_i = X_i + w_i \]
Where \( w_i \) is a purely randomly error with mean zero and variance \( \sigma_w^2 \), and \[ Corr(w_i, X_i) = 0, Corr(w_i, u_i) = 0. \] In this case, \( \hat{\beta}_1 \) has the probability limit \[ \hat{\beta}_1 \overset{p}{\longrightarrow} \frac{\sigma_X^2}{\sigma_X^2 + \sigma_w^2}\beta_1 \]
Because \( \frac{\sigma_X^2}{\sigma_X^2 + \sigma_w^2} \) is less than 1, \( \hat{\beta}_1 \) will be biased towards zero (attenuation error).
Suppose \( Y \) suffers from classical measurement error \[ \tilde{Y}_i = Y_i + w_i \]
The regression model would be \[ \tilde{Y}_i = \beta_0 + \beta_1 X_i + v_i \] where \( v_i = u_i + w_i \).
If \( w_i \) is truly random, then \( E[w_i|X_i] = 0 \) and therefore \( E[v_i|X_i] = 0 \), so \( \hat{\beta}_1 \) is unbiased.
But, because \( Var(v_i) > Var(u_i) \) our estimation will less precise.
Types of missing data
If the data is missing for reasons unrelated to \( X \) and \( Y \), this simply reduces our sample size but does not introduce bias or inconsistency.
Even if our estimators are unbiased and consistent, if our standard errors are inconsistent it will be difficult to conduct any kind of reasonable inference. We will not be able to conduct hypothesis testing at the desired significance and our confidence intervals will not be of the correct confidence level. There are two main causes for this
What we found in the California study
Threats to consider
Different nonlinear functional forms were considered but none were found to be significant. Further forms could be considered but the analysis so far suggests that this regression is not sensitive to nonlinear specifications.
The surveys' data covers all public schools in their respective states, so it is unlikely that there is a problem with selection.
We would observe this problem is there was some reverse causality between test scores and \( STR \), possibly due to some policy. No such policy was in place at the time of studies (some court cases in California have since then led such a mechanism).