Chapter 9: Assessing Studies Based on Multiple Regression

Karim Naguib (Boston University)
10/27/2013

Internal and External Validity

Populations

The population studied is the population of entities from which a sample is drawn.

The population of interest is the population of entities to which the causal inference from the study are to be applied.

Internal and External Validity

A statistical analysis is said to be internally valid if its inferences about causal effects are valid for the population studied. The analysis is said to be externally valid if its inferences and conclusions can be generalized the population and setting studied to other populations and settings.

Threats to Internal Validity

Two components of internal validity 1. The estimator of the causal effect should be unbiased and consistent. 2. Hypothesis tests should have the desired significance level, and confidence intervals should have the desired confidence level.

Threats to External Validity

Differences in population: Characteristics of the population studied or the manner in which it is selected might make the causal effects different from that in the population of interest
Differences in settings: Even if the populations are the same, differences in settings (laws, the institutional environment, physical/geographic, etc.) might threaten the external validity of a study.

External Validity To The California Test Scores and STR Study

Suppose that this study was internally valid, we found that STR had a negative effect on test scores.
Can we generalize these findings to other populations and settings?
How do we assess the external validity of a study?
- We need to use specific knowledge of the populations and settings studied and those of interest.
- If available we can compare different studies if available (consider the Massachusetts elementary school study)

Threats to Internal Validity of Multiple Regression Analysis

Possible Causes of Biasedness of OLS Estimators

Omitted variable bias
Misspecification of the functional form of the regression function
Imprecise measurement of the independent variables (errors-in-variables)
Sample selection
Simultaneous causality

Solutions To Omitted Variable Bias (1)

If the omitted variable is observed or there are adequate control variables, we can simply add them to the regression. Steps to decide on whether to include a variable or not

Identify key coefficients of interest

Deciding a priori on the variables that most likely to cause OVB if omitted (now we have our base specification)

Identify “questionable” variables that might reduce OVB and test to see if they are significant when included to the base specification and exclude them if they aren't.

Report a summary of all the regression specifications attempted to address possible objections to the study

Solutions To Omitted Variable Bias (2)

If the omitted variable(s) are not available and no adequate controls are available, we can use panel data, instrumental variables, or use a randomized controlled experiment.

Misspecification of the Functional Form of the Regression Function

If the population regression function is nonlinear and yet we use a linear regression function in our estimation we will general introduce bias. This could be, for example, because we omitted a quadratic or cubic term of a regressor.
If the dependent variable is continuous we can examine the scatterplot of the data and determine if we should include nonlinear terms to the regression.
If the dependent variable is discrete, things are more complicated and will be discussed in Chapter 11.

Measurement Error and Errors-in-Variables Bias (1)

Suppose there was a mix-up and we ended up using test scores data from 10th grades instead of the 5th graders we are interested in. While the two scores might be correlated, they are not exactly the same and this leads to errors-in-variables bias.

Measurement Error and Errors-in-Variables Bias (2)

Suppose we are interested in regressing on the variables \( X_i \) (e.g. actual earnings), but we only observe an imprecisely measured \( \tilde{X}_i \) (e.g. reported earnings). The population regression model is \[ Y_i = \beta_0 + \beta_1 X_i + u_i \]

But with the imprecisely measured \( \tilde{X}_i \) we have \[ \begin{align*} Y_i &= \beta_0 + \beta_1 \tilde{X}_i + [\beta_1(X_i - \tilde{X}_i) + u_i] \\ &= \beta_0 + \beta_1 \tilde{X}_i + v_i \end{align*} \]

If \( \tilde{X}_i \) is correlated with \( (X_i - \tilde{X}_i) \) we will have OVB.

Classical Measurement Error Model

\[ \tilde{X}_i = X_i + w_i \]

Where \( w_i \) is a purely randomly error with mean zero and variance \( \sigma_w^2 \), and \[ Corr(w_i, X_i) = 0, Corr(w_i, u_i) = 0. \] In this case, \( \hat{\beta}_1 \) has the probability limit \[ \hat{\beta}_1 \overset{p}{\longrightarrow} \frac{\sigma_X^2}{\sigma_X^2 + \sigma_w^2}\beta_1 \]

Because \( \frac{\sigma_X^2}{\sigma_X^2 + \sigma_w^2} \) is less than 1, \( \hat{\beta}_1 \) will be biased towards zero (attenuation error).

"Best Guess" Measurement Error Model

In the “best guess” model we assume that the respondent does not know the true value of \( X_i \) but reports their best guess given the information they have: they report the mean of \( X_i \) conditional on the available information.
In this case \( \tilde{X}_i \) is not correlated with \( (\tilde{X}_i - X_i) \).
If the respondent's information is uncorrelated with \( u_i \), then \( \tilde{X}_i \) is uncorrelated with \( v_i \).
Therefore, \( \hat{\beta}_1 \) is consistent and unbiased.
But, because \( Var(v_i) > Var(u_i) \) our estimation will less precise.

Measurement Error in Y

Suppose \( Y \) suffers from classical measurement error \[ \tilde{Y}_i = Y_i + w_i \]
The regression model would be \[ \tilde{Y}_i = \beta_0 + \beta_1 X_i + v_i \] where \( v_i = u_i + w_i \).
If \( w_i \) is truly random, then \( E[w_i|X_i] = 0 \) and therefore \( E[v_i|X_i] = 0 \), so \( \hat{\beta}_1 \) is unbiased.
But, because \( Var(v_i) > Var(u_i) \) our estimation will less precise.

Solutions to Errors-in-Variables Bias

Naturally, the best way to fix this problem is to meausure \( X_i \) accurately.
Instrumental variables (discussed in Chapter 12)
Using specialized knowledge of the data we can model the ratio \( \sigma_w^2/\sigma_X^2 \).

Missing Data and Sample Selection

Types of missing data

Data missing at random
Data missing based on \( X \)
Data missing base ond \( Y \), beyond depending on \( X \)

Data Missing at Random

If the data is missing for reasons unrelated to \( X \) and \( Y \), this simply reduces our sample size but does not introduce bias or inconsistency.

Data Missing Based on X

Similarly, if the data is missing for reasons related to \( X \), we end up with a reduction in sample size but no bias or inconsistency.
Depending on what values of \( X \) are missing, this situation would limit our ability to make inferences about the relationship between \( X \) and \( Y \) at the missing values of \( X \).

Data Missing Based on Y

This form of missing data can introduce correlation between the regressors and the error term (OVB)
The resulting bias is called sample selection bias.

Simultaneous Causality (1)

Simultaneous causality occurs when in addition to \( X \) having a causal effect \( Y \), the causality also goes backward from \( Y \) to \( X \).
For example, consider a policy to subsidize hiring teachers (reducing STR) for schools that have low test scores. In this case, causality can run in both directions between STR and test scores.

Simultaneous Causality (2)

Simultaneous causality can introduce correlation between the regressors and the error terms \[ \begin{align*} Y_i &= \beta_0 + \beta_1 X_i + u_i \\ X_i &= \gamma_0 + \gamma_1 Y_i + v_i \end{align*} \] Unobservable effects in \( u_i \) would effect \( Y_i \), which in turn would affect \( X_i \). If \( \gamma_1 > 0 \), \( X_i \) and \( u_i \) would be positively correlated. (This is sometimes also referred to as simultaneous equations bias.)
Two possible ways to mitigate this problem are instrumental variables and randomized controlled experiments.

Sources of Inconsistency of OLS Standard Errors

Even if our estimators are unbiased and consistent, if our standard errors are inconsistent it will be difficult to conduct any kind of reasonable inference. We will not be able to conduct hypothesis testing at the desired significance and our confidence intervals will not be of the correct confidence level. There are two main causes for this

Heteroskedasticity, which we covered before and explained how the standard errors we use are robust to this problem.
Correlation of the error term across observations

Correlation of the Error Term Across Observations (1)

This caused by a correlation between the error term across observations.
This would not normally happen if our sample is randomly selected from a population.
However, many times our sampling is only partially random: we can we observing the same entity repeatedly over time and hence “serial”“ correlation is induced in the regression.

Correlation of the Error Term Across Observations (2)

Another situation in which this happens is if we focus on sampling on a geographic unit.
To fix this problem we will discuss how to compute standard errors that are robust to both heteroskedasticity and serial correlation when we discuss panel data.

Example: Test Scores and Class Size

External Validity

To assess whether a study is externally valid (can be generalized to other populations) it is useful to have multiple studies from different population
To assess the external validity of the California school districts study we can compare it to another study conducted in Massachusetts, based on standardized test results for fourth graders in 220 public school districts.
Since the populations and the settings are broadly similar, finding similar results in the Massachusetts study would provide evidence of external validity.

Comparing Summary Statistics

Comparison of Mass. and California Results (1)

What we found in the California study

Controlling for student background characteristics reduced the coefficient on \( STR \) from -2.28 to -0.73.
The coefficient on \( STR \) was found to be significant at the 1% level after controlling for student background and school district economic characteristics.
The effect of \( STR \) did not depend in an important way on the percentage of English learners.
There is evidence that the relationship between \( STR \) and \( TestScore \) is nonlinear.

Comparison of Mass. and California Results (2)

A similar dramatic reduction in the coefficient on \( STR \) is observed on controlling for background characteristics.
The coefficient on \( STR \) is significant at the 5\% level. This could be because the sample size is larger in the California study.
No significant difference is found on interacting with the percentage of English learners.
We are not able to reject the hypothesis that the relationship between \( STR \) and \( TestScore \) is linear.

Internal Validity

Threats to consider

Omitted variable bias
Functional form
Errors-in-variables
Selection
Simultaneous causality
Heteroskedasticity and correlation of the error term

Omitted Variable Bias

Study controls for student and district background characteristics.
Possible OVB can still exist if some unobserved factor is correlated with \( STR \) (e.g. teacher quality)
An experiment would be one possible way to remove all doubt about OVB

Functional Form

Different nonlinear functional forms were considered but none were found to be significant. Further forms could be considered but the analysis so far suggests that this regression is not sensitive to nonlinear specifications.

Errors-in-Variables

\( STR \) is possibly inaccurate since students move between districts and therefore it does not accurately measure the real \( STR \) on taking the standardized tests.
Another variable with potential measurement error is average district income. It was obtained from the 1990 census data, but all other variables are from 1998 (Mass.) or 1999 (California). Economic composition of the districts could have changed substantially over this time.

Selection

The surveys' data covers all public schools in their respective states, so it is unlikely that there is a problem with selection.

Simultaneous Causality

We would observe this problem is there was some reverse causality between test scores and \( STR \), possibly due to some policy. No such policy was in place at the time of studies (some court cases in California have since then led such a mechanism).

Heteroskedasticity and Correlation of Error Terms

Heteroskedasticity robust standard errors are used, so heteroskedasticity should not pose a problem.
Since all observations are from the same state, there is possibly some effect that state specific that is correlated across observations.