class: center, middle # Applied topics: Measurement error and Missing data ### Dr. Francisco J. Cabrera-Hernández #### EconometrÃa #### MaestrÃa en EconomÃa Primavera 2025 #####CIDE Santa Fe, Ciudad de México. --- ##Introduction We now investigate usual empirical issues. - Proxy variables - Measurement Error - Missing Data --- ## Proxy Variables Used for unobserved important explanatory variables, for example: `$$log(wage) = \beta_0 + \beta_1 educ + \beta_2 exper + \beta_3 ability + e$$` General approach: `$$Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \beta_3 X_3^* + e$$` Where `\(X_3^*\)` is omitted. `$$X_3^* = \delta_0 + \delta_3 X_3 + e_3$$` This is the (unobserved) relationship of `\(X_3^*\)` with its proxy. Hence, `\(\delta_0 +\delta_3X_3\)` is what we include in our estimations. --- ## Proxy Variables For this to work, the proxy does not belong to the population regression: `\(E[X_3e]=0\)` Other variables in addition will not help predict the omitted: `$$E[X_3^*|X_1, X_2, X_3] = E[X_3^*| X_3] = \delta_0 + \delta_3 X_3$$` Under these assumptions: `$$Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \beta_3 X_3^* + e$$` `$$X_3^* = \delta_0 + \delta_3 X_3 + e_3$$` `$$Y = (\beta_0 + \beta_3\delta_0) + \beta_1 X_1 + \beta_2 X_2 + (\beta_3\delta_3)X_3 + (u+\beta_3e_3)$$` **The coefficient for the proxy variable is a multiple of the coefficient of the ommitted variable.** --- ## Proxy Variables What if we want to use IQ to proxy "skill" in a mincer equation? `$$wage = \hat\beta_0 + \hat\beta_1 education + \hat\beta_2 experience + \hat\beta_3 IQ +\mu$$` Assumption 1: *should hold* as IQ score is not a direct determinant of wage. What matters is how able is the person to convert it in a higher wage at work. Assumption 2: **most of the variation in ability should be explained by variation in IQ; leaving a small rest to education and experience.** While IQ imperfectly absorbs the variation of ability, including it at least reduces the biased in the measured return to education. --- ## Proxy Variables <div class="figure" style="text-align: center"> <img src="data:image/png;base64,#iqbias.png" alt=" " width="75%" /> <p class="caption"> </p> </div> --- ## Proxy Variables Using lagged dependent variables as proxy variables: Crime rates example at the city level: `$$Crime = \beta_0 + \beta_1 unemployment + \beta_2 expenditure + \beta_3 crime_{t-1} + e$$` Including past crime controls for omitted factors that determine crime rate each year. Compares cities with the same crime rate last year. **Not the best practice as there could be reversion to mean** --- ## Proxy Variables <div class="figure" style="text-align: center"> <img src="data:image/png;base64,#crimebias.png" alt=" " width="75%" /> <p class="caption"> </p> </div> --- ## Measurement Error **In the dependent variable `\(Y^*\)` ** Satisfying Gauss-Markov Assumptions: `$$Y^*= \beta_0 + \beta_1X_1 + ... \beta_k X_k + e$$` We define measurement error in the population as: `$$e_0 = Y - Y^*$$` `$$Y^* = Y - e_0$$` `$$Y = \beta_0 + \beta_1 + ... + \beta_k X_k + e + e_0$$` --- ## Measurement Error `$$Y = \beta_0 + \beta_1 + ... + \beta_k X_k + e + e_0$$` The composed error should have a zero mean if is independent of each `\(X_k\)`, OLS is unbiased and consistent. It just changes the intercept (which we do not care about most of the times). `$$Var(e+e_0) = \sigma^2 + \sigma^2_0 > \sigma^2$$` Hence, larger error variance. Less efficiency. --- ## Measurement Error Measurement error in an explanatory variable: `$$Y = \beta_0 + \beta_1X_1^* + e (1)$$` `$$X_1^* = X_1-e_1 (2)$$` `$$X_1 = X_1^*+e_1 (3)$$` If: `$$E[Y|X_1,X_1^*] = E[Y|X_1^*]$$` --- ## Measurement Error **Possibility 1:** `$$Cov(X_1,e_1)=0$$` Substituting (2) in (1): `\(Y = \beta_0 + \beta_1X_1 + (e - \beta_1e_1)\)` With: `$$Var(e - \beta_1e_1) = \sigma_e^2 + \beta_1^2 \sigma_{e_1}^2$$` As `\(Cov(X_1,e_1)=0\)`, OLS is unbiased and less efficient. --- ## Measurement Error **Possibility 2:** Classical errors-in-variables (CEV) assumptions: `\(E(X_1^*e_1)=0\)`. The error coming from measurement uncorrelated to true value. Yet, from (3): `$$Cov(X_1, e_1) = E(X_1e_1) + E(e_1^2) = 0 + \sigma_{e_1}^2$$` This is different from zero and is the measurement error variance. Hence: `$$Y = \beta_0 + \beta_1 X_1 + (e - \beta_1e_1)$$` `$$Cov(x_1, e-\beta_1e_1) = -\beta_1Cov(X_1,e_1)= \color{green}{-\beta_1\sigma^2_{e_1}}$$` Under CEV, OLS of `\(Y\)` on `\(X_1\)` is biased and inconsistent. --- ## Measurement Error Asymptotically; `$$plim(\hat\beta_1) = \beta_1 + {Cov(X_1, e -\beta_1 e_1) \over Var(X_1)}$$` `$$= \beta_1 - { \color{green}{\beta_1\sigma^2_{e_1}} \over \sigma^{2*}_{x_1} + \sigma^{2}_{e_1}}$$` `$$= \beta_1 \big( 1 - { \sigma^2_{e_1} \over \sigma^{2*}_{x_1} + \sigma^{2}_{e_1}} \big)$$` `$$= \beta_1 \big( { \sigma^{2*}_{x_1} \over \sigma^{2*}_{x_1} + \sigma^{2}_{e_1}} \big)$$` This is the **attenuation bias.** that will be lower if `\(\sigma^{2*}_{x_1}\)` is large *respect to* `\(\sigma^{2}_{e_1}\)` --- ## Measurement Error When we add more explanatory variables: `$$Y = \beta_0 + \beta_1X_1^* + \beta_2X_2 + \beta_3X_3 + e$$` We generally asumme `\(e_1\)` to be uncorrelated with `\(X_2\)` and `\(X_3\)`: `\(Cov(X_k,e_1)=0\)` `$$Y = \beta_0 + \beta_1X_1^* + \beta_2X_2 + \beta_3X_3 + e - \beta_1e_1$$` If `\(Cov(X_k, e_1) \ne 0\)` `$$X_1 = \alpha_0 + \alpha_1X_2 + \alpha_2 X_3 + r_1*$$` `$$plim (\hat\beta_1)= \beta_1 \big( { \sigma^{2*}_{r_1} \over \sigma^{2*}_{r_1} + \sigma^{2}_{e_1}} \big)$$` Measurement error in a single variable causes inconsistency in all estimators if there is collinearity between `\(X_1, X_2, X_3\)`. --- ## Missing data This is a case of **sample selection**. Sample selection is not a problem if its at random (uncorrelated with error) Missing data is a case of sample selection. If sample selection is based on the variables included in the regression, not a problem as we condition on these. Sample selection is a problem if correlates with covariates and or the dependent variable **(i.e. endogenous sample selection)** --- ## Missing data **Missing at random (MAR):** `$$savings = \beta_0 + \beta_1 income + \beta_2 age + \beta_3hhsize + e$$` If the sample was non-random as certain age or income groups were over or under sampled. This is not a problem because the regression examines saving for subgroups defined by age and income. **MCAR:** Missings completely unrelated to `\(X_k\)` and `\(e\)`. --- ## Missing data **Endogenous sample selection:** `$$wealth = \beta_0 + \beta_1 educ + \beta_2 experience + \beta_3age + e$$` If the sample is nonrandom as individuals refuse to answer if their wealth is particularly high or low. This biases the regression results because these individuals may be systematically different. This relates to unobservables. --- ## Outliers (Leverage) If they are part of the DGP and asymptotically do not loose importance, we cannot ignore them. A solution is the estimation of Least Absolute Deviations (LAD) It minimizes the sum of absolute deviations. These are nor squared. `$$min \sum_{i=1}^n |y_i - b_0 - b_1 x_{i1} - ... - b_k x_{ik}|$$` It does not estimate the conditional mean, but the conditional median. This is a special case of a quantile regression. --- <style> .centered-word { position: absolute; top: 50%; left: 50%; transform: translate(-50%, -50%); } </style> <div class="centered-word"> <h2>The End</h2> </div>