In an effort to add additional theoretical and practical clarity to the class, a number of days will include brief simulations demonstrating important concepts or tools. Today we will cover omitted variables and the conditions under which fixed or random effects are useful for addressing the issue. As we will see, modeling structure within the data – whether through fixed or random effects – can sometimes help and sometimes hurt. If this is completely unfamiliar see this article for comments on fixed effects in particular.
Advice in the literature regarding which covariates to include within regression analysis is contradictory. One recommendation, usually in older work, was to condition on as many variables as possible. Others have recommended to include all covariates related to the “treatment” of interest while others have recommended including all covariates related to the outcome.
As noted by Clarke (2005), scholars are often led towards bloated “garbage can” regressions out of the fear of omitted variables. Such advice can easily, however, lead to what is known as bad control and bias amplification – the inclusion of additional control variables may increase or decrease bias and, in general, we do not know which is the case in any particular situation. See, for example, Thoemmes (2015) and the citations therein.
More recently these arguments have been applied to the inclusion of fixed or random effects as well. After all, random effects are regularized fixed effects and fixed effects are control variables. In short, while textbook treatments of fixed/random effects usually advertise them as a way of accounting for omitted time-invariant variables, in practice we never know whether bias has increased or decreased outside of two situations – (1) when we simulate the data ourselves and so know the truth and (2) when valid alternative identification strategies, such as instruments, are available.
The key message I’d like to press upon you is to be aware of the uncertainty surrounding identification-by-control strategies. One active literature worth reading would be the one centering around the work of Judea Pearl which makes it lucidly clear that the “correct” conditioning set is itself conditioned on the structural model in mind, and that such a model is “true.” In general this is how traditional statistical models work – standard statistical inference is only valid if the only unknowns are the parameters to be estimated. If anything else is unknown – such as the underlying structural model or correct specification – then things become more difficult and tenuous, if not impossible. There is no panacea.
Omitted variable bias is best understood as a particular type of regressor-error dependency. Before we jump into multilevel models, let’s look at some examples of good or bad control. We will begin by simulating some correlated data.
set.seed(1234)
library(MASS)
library(texreg)
# Generate X data
n_obs <- 1000
cor_mat <- matrix(c(1,-0.5,0.5,
-0.5,1,-0.5,
0.5,-0.5,1),nrow=3,ncol=3)
X <- mvrnorm(n_obs,rep(0,3),cor_mat,empirical = T)
# Make X4 into a "bad control"
u1 <- rnorm(n_obs)
u2 <- rnorm(n_obs)
X4 <- u1 + u2 + rnorm(n_obs)
X[,1] <- X[,1] + u1
# Generate Outcome
betas <- c(1,1,1)
y <- 3 + X %*% betas + u2 + rnorm(n_obs)
d <- data.frame(y,X,X4=X4)
# Estimate Pooled Models
m1 <- lm(y ~ X1 + X2 + X3,data=d)
m2 <- lm(y ~ X1 + X3,data=d)
m3 <- lm(y ~ X1 + X2,data=d)
m4 <- lm(y ~ X1,data=d)
m5 <- lm(y ~ X1 + X2 + X3 + X4,data=d)
htmlreg(l=list(m1,m2,m3,m4,m5),
doctype = FALSE,
caption="")
| Model 1 | Model 2 | Model 3 | Model 4 | Model 5 | ||
|---|---|---|---|---|---|---|
| (Intercept) | 2.96*** | 2.95*** | 2.96*** | 2.96*** | 2.95*** | |
| (0.04) | (0.05) | (0.05) | (0.05) | (0.04) | ||
| X1 | 1.05*** | 0.90*** | 1.19*** | 1.03*** | 0.81*** | |
| (0.03) | (0.04) | (0.04) | (0.04) | (0.03) | ||
| X2 | 1.03*** | 0.61*** | 0.94*** | |||
| (0.05) | (0.05) | (0.05) | ||||
| X3 | 0.98*** | 0.54*** | 1.06*** | |||
| (0.05) | (0.05) | (0.05) | ||||
| X4 | 0.37*** | |||||
| (0.03) | ||||||
| R2 | 0.63 | 0.48 | 0.49 | 0.43 | 0.70 | |
| Adj. R2 | 0.63 | 0.48 | 0.49 | 0.43 | 0.69 | |
| Num. obs. | 1000 | 1000 | 1000 | 1000 | 1000 | |
| RMSE | 1.35 | 1.61 | 1.59 | 1.69 | 1.23 | |
| p < 0.001, p < 0.01, p < 0.05 | ||||||
The first model is correctly specified and gives the correct coefficient estimate for X1. Models 2 and 3 each omit one of the control variables, yielding biased estimates of the effect of X1. Model 4 shows how, in this case, the biases from X2 and X3 directly offset with both are excluded, yielding a correct coefficient estimate. Finally, Model 5 demonstrates “bad control” where the inclusion of X4 biases the coefficient estimates.
Additional details are a bit outside the scope of what we want to talk about here. Note that these concerns are relevant for any covariate adjustment technique, including propensity scores. For additional information see Pearl (2009), Elwert and Winship (2014), and Steiner and Kim (2016) for related discussions.
So when can the inclusion of fixed/random effects help? Simply put, when the model is correct except for factors which vary only across, but not within, units. We saw a great example of this with Simpson’s Paradox. Let’s take a closer look at that example.
library(ggplot2)
set.seed(1234)
nobs <- 1000
n_groups <- 4
grp <- rep_len(1:n_groups,nobs)
labels <- letters[grp]
grp_means <- -seq(-n_groups,n_groups,length=n_groups)
x <- rnorm(nobs,grp_means[grp])
err <- rnorm(nobs)
y <- x + 8*grp + err
simp <- data.frame(y,x,grp,grp_means[grp],labels)
ggplot(simp,aes(x=x,y=y,color=labels)) +
geom_point()
The last time we saw this it was simply to illustrate what fixed and random effects do and how they relate to each other. What we didn’t talk about in detail was how the data was generated and alternative approaches to estimate the model. In particular, I’d like to draw your attention to the dataset and DGP.
y <- x + 8*grp + err
simp <- data.frame(y,x,grp,grp_means[grp],labels)
head(simp)
## y x grp grp_means.grp. labels
## 1 9.587601 2.7929343 1 4.000000 a
## 2 17.912229 1.6107626 2 1.333333 b
## 3 22.211963 -0.2488922 3 -1.333333 c
## 4 26.289673 -6.3456977 4 -4.000000 d
## 5 13.132076 4.4291247 1 4.000000 a
## 6 15.933506 1.8393892 2 1.333333 b
Effectively what we have done is create the grouping structure, reflected by our labels, by omitting the covariate grp from the analysis. Look what happens if we instead include it as a variable.
m1 <- lm(y ~ x + grp, data = simp)
m2 <- lm(y ~ x + labels, data = simp)
htmlreg(l=list(m1,m2),
doctype = FALSE,
caption = "")
| Model 1 | Model 2 | ||
|---|---|---|---|
| (Intercept) | -0.43 | 7.76*** | |
| (0.22) | (0.14) | ||
| x | 1.06*** | 1.06*** | |
| (0.03) | (0.03) | ||
| grp | 8.18*** | ||
| (0.09) | |||
| labelsb | 8.12*** | ||
| (0.12) | |||
| labelsc | 16.41*** | ||
| (0.19) | |||
| labelsd | 24.50*** | ||
| (0.26) | |||
| R2 | 0.97 | 0.97 | |
| Adj. R2 | 0.97 | 0.97 | |
| Num. obs. | 1000 | 1000 | |
| RMSE | 0.98 | 0.98 | |
| p < 0.001, p < 0.01, p < 0.05 | |||
The coefficient estimate for x, our variable of interest, it the same in either case and a good estimate of the true effect. This is because the fixed effects were exactly what was needed to model this omitted variable.
The idea extends naturally to multiple omitted variables that have the same structure and more complex grouping. We can illustrate this with a two-way fixed effects model and a more complicated structure.
set.seed(1234)
nobs <- 1000
n_g1 <- 4
n_g2 <- 8
grp1 <- sample(1:n_g1,nobs,replace = T)
grp2 <- sample(1:n_g2,nobs,replace = T)
labels1 <- letters[grp1]
labels2 <- letters[26-grp2]
grp_means1 <- rnorm(n_g1,0,2)
grp_means2 <- rnorm(n_g2,0,2)
x <- rnorm(nobs,grp_means1[grp1] + grp_means2[grp2])
err <- rnorm(nobs)
y <- x + grp1 - grp2 + err
data <- data.frame(y,x,
grp1,grp2,
grpm1 = grp_means1[grp1],
grpm2 = grp_means2[grp2],
labels1,labels2)
ggplot(data,aes(x=x,y=y,color=paste(labels1,"-",labels2))) +
geom_point()
Here we now have two variables that will be omitted; grp1 and grp2. Both are constant within groups. Group assignments were sampled randomly such that observations have multiple membership. Let’s estimate a number of models to see what happens.
m1 <- lm(y ~ x,data = data)
m2 <- lm(y ~ x + grp1 + grp2, data = data)
m3 <- lm(y ~ x + labels1 + labels2, data = data)
m4 <- lm(y ~ x + labels1, data = data)
m5 <- lm(y ~ x + labels2, data = data)
htmlreg(l=list(m1,m2,m3,m4,m5),caption = "",doctype = F)
| Model 1 | Model 2 | Model 3 | Model 4 | Model 5 | ||
|---|---|---|---|---|---|---|
| (Intercept) | -2.12*** | 0.14 | -6.89*** | -4.21*** | -5.77*** | |
| (0.08) | (0.11) | (0.11) | (0.16) | (0.13) | ||
| x | 0.79*** | 0.99*** | 0.97*** | 0.54*** | 1.15*** | |
| (0.03) | (0.01) | (0.03) | (0.03) | (0.02) | ||
| grp1 | 0.96*** | |||||
| (0.03) | ||||||
| grp2 | -1.00*** | |||||
| (0.02) | ||||||
| labels1b | 1.01*** | 2.24*** | ||||
| (0.13) | (0.23) | |||||
| labels1c | 1.87*** | 1.60*** | ||||
| (0.09) | (0.20) | |||||
| labels1d | 2.95*** | 4.24*** | ||||
| (0.15) | (0.24) | |||||
| labels2s | 1.05*** | 1.01*** | ||||
| (0.13) | (0.18) | |||||
| labels2t | 1.99*** | 2.45*** | ||||
| (0.15) | (0.18) | |||||
| labels2u | 2.91*** | 3.80*** | ||||
| (0.19) | (0.20) | |||||
| labels2v | 3.83*** | 4.31*** | ||||
| (0.17) | (0.19) | |||||
| labels2w | 5.14*** | 5.33*** | ||||
| (0.13) | (0.17) | |||||
| labels2x | 5.89*** | 7.00*** | ||||
| (0.26) | (0.23) | |||||
| labels2y | 6.89*** | 7.12*** | ||||
| (0.14) | (0.18) | |||||
| R2 | 0.45 | 0.92 | 0.92 | 0.58 | 0.84 | |
| Adj. R2 | 0.45 | 0.92 | 0.92 | 0.58 | 0.84 | |
| Num. obs. | 1000 | 1000 | 1000 | 1000 | 1000 | |
| RMSE | 2.59 | 1.01 | 1.01 | 2.26 | 1.40 | |
| p < 0.001, p < 0.01, p < 0.05 | ||||||
Models 2 and 3 are the correct specifications. In the former we include what we are here considering omitted variables. In the latter, we include the group identifiers as covariates, which are able to “suck up” this form of endogeneity. In all other models we are faced with a specification problem. Of particular note are Models 4 and 5 which include only one set of fixed effects. In the former we have found an example of bias amplification – compared to Model 1 which included no fixed effects, Model 4 is even more biased despite being “closer” to an adequate specification. By contrast, Model 5 reverses the sign of the bias.
It is simple to specify this within a random effects framework as well.
library(lme4)
m6 <- lmer(y ~ x + (1|labels1) + (1|labels2), data= data)
summary(m6)
## Linear mixed model fit by REML ['lmerMod']
## Formula: y ~ x + (1 | labels1) + (1 | labels2)
## Data: data
##
## REML criterion at convergence: 2931.7
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -3.06757 -0.68444 0.02147 0.66298 2.94074
##
## Random effects:
## Groups Name Variance Std.Dev.
## labels2 (Intercept) 5.837 2.416
## labels1 (Intercept) 1.577 1.256
## Residual 1.022 1.011
## Number of obs: 1000, groups: labels2, 8; labels1, 4
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) -1.97283 1.06065 -1.86
## x 0.96548 0.03246 29.74
##
## Correlation of Fixed Effects:
## (Intr)
## x 0.011
You’ll note that this is pretty close to what had been estimated with the two-way fixed effects model.
As a final example, we will look at the case of omitted variables which do not satisfy the “no variation within group” requirement by adapting the above.
set.seed(1234)
nobs <- 1000
n_g1 <- 4
n_g2 <- 8
grp1 <- sample(1:n_g1,nobs,replace = T)
grp2 <- sample(1:n_g2,nobs,replace = T)
labels1 <- letters[grp1]
labels2 <- letters[26-grp2]
grp_means1 <- rnorm(n_g1,0,2)
grp_means2 <- rnorm(n_g2,0,2)
x1 <- rnorm(nobs,grp_means1[grp1] + grp_means2[grp2])
x2 <- x1 + rnorm(nobs)
err <- rnorm(nobs)
y <- x1 + x2 + grp1 - grp2 + err
data <- data.frame(y,x,grp1,grp2,grp_means1[grp1],grp_means2[grp2],labels1,labels2)
m7 <- lm(y ~ x1 + x2 + labels1 + labels2, data = data)
m8 <- lm(y ~ x1 + labels1 + labels2, data = data)
htmlreg(list(m7,m8),doctype = F,caption = "")
| Model 1 | Model 2 | ||
|---|---|---|---|
| (Intercept) | -6.96*** | -6.85*** | |
| (0.10) | (0.15) | ||
| x1 | 1.00*** | 1.96*** | |
| (0.04) | (0.05) | ||
| x2 | 0.99*** | ||
| (0.03) | |||
| labels1b | 1.02*** | 1.03*** | |
| (0.13) | (0.19) | ||
| labels1c | 2.03*** | 1.90*** | |
| (0.09) | (0.13) | ||
| labels1d | 3.05*** | 3.00*** | |
| (0.15) | (0.21) | ||
| labels2s | 0.90*** | 0.95*** | |
| (0.13) | (0.18) | ||
| labels2t | 1.91*** | 1.90*** | |
| (0.14) | (0.21) | ||
| labels2u | 2.84*** | 2.76*** | |
| (0.19) | (0.27) | ||
| labels2v | 3.91*** | 3.74*** | |
| (0.17) | (0.24) | ||
| labels2w | 5.02*** | 5.15*** | |
| (0.13) | (0.18) | ||
| labels2x | 5.89*** | 5.78*** | |
| (0.26) | (0.36) | ||
| labels2y | 6.92*** | 6.82*** | |
| (0.14) | (0.19) | ||
| R2 | 0.97 | 0.94 | |
| Adj. R2 | 0.97 | 0.94 | |
| Num. obs. | 1000 | 1000 | |
| RMSE | 1.00 | 1.42 | |
| p < 0.001, p < 0.01, p < 0.05 | |||
Here x2 is an omitted variable. When it is included in the specification everything is kosher while when it is removed the fixed effects do nothing to remedy the problem.
There are other specific contexts in which fixed/random effects will fail to provide unbiased estimates of treatment effects. See Imai and Kim (2019) for a discussion regarding how this limits applications to longitudinal data in particular.
In the above we saw that, insofar as omitted factors are invariant within groups, that they may be captured using knowledge of group membership. But what if we don’t know that either? While somewhat outside the scope of the class, this touches on a closely related family of latent variable models. In particular, there is a clean bridge to mixture models in which it is suspected or known that subpopulations exist within the sample, but the membership within such models is unknown. There is a cool illustration available here.
The best way to learn R is to use R, and the best way to really understand methods is to poke around at them either with the mathematical theory or simulations. I suggest considering the following questions: