A Primer on Fixed Effects

Fixed effects are a very popular method in education policy. In my opinion, the discussion of their methods is often over-complicated, because in reality, the way that fixed effects work is not that different from some things that people already understand. The purpose of this page is to give an overview of fixed effects and their use in data analysis in the education policy world. You don’t have to worry about understanding the R code, especially if you are not using R, but pay attention to the output.

Categorical independent variables and de-meaning

Before we begin, let’s consider the following regression model.

$weeksal = \beta_0 + \beta_1 (female) + \varepsilon$

Here, we are looking at an ordinary least squares regression of weekly salary on a binary indicatory (0/1) for whether or not someone is female.

Before we even run a model, let’s check out the data. Just in terms of the data in front of us, is there a difference between males and females on weekly salary? Let’s check it out (with data that I made up, in a dataframe called d).

# Show the average salary for women
mean(d[d$female==1, "weeksal"])

[1] 1666.991

# Show the average salary for men
mean(d[d$female==0, "weeksal"])

[1] 1718.737

Is the difference statistically significant? Sounds like a two-sample independent t-test to me.

# Check statistical significance
t.test(d[d$female==1, "weeksal"], d[d$female==0, "weeksal"])


    Welch Two Sample t-test

data:  d[d$female == 1, "weeksal"] and d[d$female == 0, "weeksal"]
t = -9.6156, df = 994.92, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -62.30632 -41.18572
sample estimates:
mean of x mean of y 
 1666.991  1718.737

Lo and behold, there is a statistically significant difference between the weekly salary for female faculty and the weekly salary for male faculty. The difference is 51.75. Keep this in mind.

Let’s try this out using the regression we presented above.

# Test using linear regression
model <- lm(weeksal ~ female, data=d)
summary(model)

term	estimate	std.error	statistic	p.value
(Intercept)	1718.73730	3.89366	441.41998	0
female	-51.74602	5.38917	-9.60185	0

Based on the results, we might construct a predictive model that looks like this.

$\widehat{weeksal} = 1718.74 - 51.75 (female)$

Do the values of the intercept and the $\beta_1$ coefficient look familiar? They should. They are exactly the mean salary of males as shown before, and the difference between males and females as we discussed. This should make sense: the interpretation of the intercept of a regression model is the predicted value of the outcome variable when the independent variable (female) is equal to 0, so the intercept is basically the predicted value for males. To get the predicted value for females, we would enter a value of 1 for the female variable - 925.06-167.03(1) = 758.03. This is exactly the mean for females as we saw before. Fun, right? The regression coefficent on the female variable is exactly the systematic difference between males and females in terms of weekly salary.

Let’s extend this further. Let’s say we actually wanted to know the relationship between hours worked and weekly salary, but we wanted to control for sex. In other words, we were interested in $\beta_2$ in the following equation.

$weeksal = \beta_0 + \beta_1 (female) + \beta_2 (hours) + \varepsilon$

Let’s do it!

# Run updated model
model <- lm(weeksal ~ female + hours, data=d)
summary(model)

term	estimate	std.error	statistic
(Intercept)	969.74886	51.16251	18.95429
female	-144.84075	8.00968	-18.08322
hours	18.74068	1.27710	14.67442

It looks like that for every hour increase in hours worked, there is an expected 19.71 increase in weekly salary controlling for sex. Why is this control important? Well, let’s check for the relationship without controlling for sex.

# Run updated model
model <- lm(weeksal ~ hours, data=d)
summary(model)

term	estimate	std.error	statistic	p.value
(Intercept)	1672.60880	38.32027	43.64815	0.00000
hours	0.44919	0.89797	0.50023	0.61703

Now, we find no significant relationship between hours and weekly salary. Why might this be? A graph might help.

# Create scatter plot
ggplot(d, aes(x=hours, y=weeksal)) +
  geom_point() +
  xlab("Hours worked") +
  ylab("Weekly salary")

Doesn’t look like there’s a relationship, does there? Let’s color the dots by sex, though.

# Create scatter plot
ggplot(d, aes(x=hours, y=weeksal, colour=factor(female))) +
  geom_point() +
  xlab("Hours worked") +
  ylab("Weekly salary") + 
  scale_colour_discrete(name="Female")

Hm, now we might have something. It looks like generally, if we were do draw an independent regression line for each sex, we would see a positive relationship between hours worked and weekly salary. This is basically what controlling for the female variable does: if females and males were equal on hours worked and if females and males were equal on weekly salary, this is would the relationship between the two variables would look like. So imagine taking those two big colored clusters of data and dragging them so their centroids were on top of each other. In fact, we can do that by subtracting the mean of each variable for each sex from the data. This process is called “de-meaning.”

$new\_weeksal = weeksal-\overline{weeksal}_{sex}$ $new\_hours = hours-\overline{hours}_{sex}$

# Get means
f_weeksal_mean <- mean(d[d$female==1, "weeksal"])
m_weeksal_mean <- mean(d[d$female==0, "weeksal"])
f_hours_mean <- mean(d[d$female==1, "hours"])
m_hours_mean <- mean(d[d$female==0, "hours"])
# Copy variables
d$new_weeksal <- d$weeksal
d$new_hours <- d$hours
# Subtract means
d$new_weeksal[d$female==1] <- d$new_weeksal[d$female==1] - f_weeksal_mean
d$new_weeksal[d$female==0] <- d$new_weeksal[d$female==0] - m_weeksal_mean
d$new_hours[d$female==1] <- d$new_hours[d$female==1] - f_hours_mean
d$new_hours[d$female==0] <- d$new_hours[d$female==0] - m_hours_mean

Aaaand let’s get that graph again.

# Create scatter plot
ggplot(d, aes(x=new_hours, y=new_weeksal, colour=factor(female))) +
  geom_point() +
  xlab("Hours worked (de-meaned)") +
  ylab("Weekly salary (de-meaned)") + 
  scale_colour_discrete(name="Female")

Looks like we’ve just moved those clusters on top of each other. Why don’t we try the regression of weekly salary on hours again?

$new\_weeksal = \beta_0 + \beta_2 (new\_hours) + \varepsilon$

# Run de-meaned model
model <- lm(new_weeksal ~ new_hours, data=d)
summary(model)

term	estimate	std.error	statistic	p.value
(Intercept)	0.00000	2.44122	0.00000	1
new_hours	18.74068	1.27646	14.68178	0

See that we now have a $\beta_2$ value of 18.74. Notice that this is the exact same as the value we got for $\beta_2$ in our original model when we had controlled for female.

What are the implications here? There are several ways you could think about this.

By controlling for female in the original equation, we are allowing for separate regression lines with different intercepts. (This is not my preferred way to think about it, since we are only estimating one line at the end of the day, but you will hear people describe it this way.)
By controlling for female in the original equation, we are estimating the relationship between hours worked and weekly salary within sex categories. (This is not terrible, but I’m also not a fan of this explanation since it still makes it sound like we’re estimating separate regression lines.)
By controlling for female in the original equation, we are estimating the relationship between hours worked and weekly salary having adjusted for systematic differences in hours worked and weekly salary between females and males. (A bit wordy, yes, but this is my preferred interpretation.)

So what’s the summary here? The reason that we found it important to control for sex in the equation predicting weekly salary from hours worked is because there was a relationship between sex and hours worked and between sex and weekly salary. We can get rid of these relationships by, for each variable, subtracting out the mean for each sex. Alternatively, we can include a binary variable for sex in the model, which produces the exact same point estimate.

I should note, on the side, that this basic interpretation can be extended to controlling for continuous variables as well.

Fixed effects in cross-sectional analysis

Now that we understand categorical variables, here’s a secret: Fixed effects (in the context of this page) are just a fancy extension of the idea of controlling for a categorical variable.

(Important note: The term “fixed effects” means something completely different in the multilevel modeling framework. I didn’t make the rules, sorry.)

Let’s suppose the different faculty in our study are in different institutions. We might be concerned that there are systematic differences in hours worked across institutions, and there might also be systematic differences in weekly salary. (I know faculty aren’t paid on a weekly salary, by the way. Just use your imagination.) Let’s look at that graph of hours verses salary again, but this time colored by institution (of which there are 10).

# Create scatter plot
ggplot(d, aes(x=hours, y=weeksal, colour=factor(instid))) +
  geom_point() +
  xlab("Hours worked") +
  ylab("Weekly salary") + 
  scale_colour_discrete(name="Institution ID")

Hmm, looks like we may indeed have some significant differences in salary and hours worked across institutions. How would we take care of this? Same way we took care of sex in the previous models. We can either (a) subtract out the institution-level means for the variables, or (b) control for the institution in the model. Remember that controlling for a multi-category variable involves including a dummy variable for each category and leaving one out. The model would look something like this.

$weeksal_{ij} = \beta_0 + \beta_1 (hours_{ij}) + \beta_2 (female_{ij}) + \gamma_j + \varepsilon_{ij}$

Note that we’ve now added some indices for clarity, where $i$ is the faculty member and $j$ is the institution. So what is $\gamma_j$ ? We could have written out every dummy variable for each institution (except one). However, often, there are too many clusters for this to be practical. We write it as $\gamma_j$ , which we call the institution fixed effect. It is shorthand for indicating that we either put in dummy variables for the institutions, OR we de-meaned the data by institution. Let’s run the model.

# Create dummy variables
d$instid_1 <- ifelse(d$instid==1, 1, 0)
d$instid_2 <- ifelse(d$instid==2, 1, 0)
d$instid_3 <- ifelse(d$instid==3, 1, 0)
d$instid_4 <- ifelse(d$instid==4, 1, 0)
d$instid_5 <- ifelse(d$instid==5, 1, 0)
d$instid_6 <- ifelse(d$instid==6, 1, 0)
d$instid_7 <- ifelse(d$instid==7, 1, 0)
d$instid_8 <- ifelse(d$instid==8, 1, 0)
d$instid_9 <- ifelse(d$instid==9, 1, 0)
d$instid_10 <- ifelse(d$instid==10, 1, 0)
# Running the model
model <- lm(weeksal ~ hours + female + instid_2 + instid_3 + instid_4 + 
                      instid_5 + instid_6 + instid_7 + instid_8 +
                      instid_9 + instid_10, data=d)
summary(model)

term	estimate	std.error	statistic	p.value
(Intercept)	949.71224	51.62828	18.39520	0.00000
hours	18.76000	1.27247	14.74297	0.00000
female	-166.94949	9.83248	-16.97939	0.00000
instid_2	14.32314	10.85697	1.31926	0.18739
instid_3	20.52511	10.86608	1.88891	0.05920
instid_4	36.36197	11.02204	3.29902	0.00100
instid_5	18.51329	11.30493	1.63763	0.10182
instid_6	33.91396	11.86406	2.85855	0.00435
instid_7	39.39121	12.55303	3.13799	0.00175
instid_8	40.28037	12.99792	3.09899	0.00200
instid_9	48.82702	13.03837	3.74487	0.00019
instid_10	55.41673	13.20897	4.19539	0.00003

Focus on the coefficient on the hours variable. Remember those three interpretations I gave of the inclusion of the female variable above? Those same three interpretations apply here.

By including the institution fixed effect, we are allowing for separate regression lines, 10 of them, where each one has its own intercept.
By including the institution fixed effect, we are estimating the relationship between hours worked and weekly salary within institution. (Important: this is why a fixed effect estimator is also called a “within” estimator.)
By including the institution fixed effect, we are estimating the relationship between hours worked and weekly salary having adjusted for systematic differences in hours worked and weekly salary across institutions.

So essentially, you can view a fixed effect analysis as one that controls for systematic differences across clusters, just like controlling for a categorical variable. There are two important distinctions between how we treat fixed effects and how we treat categorical variables. These things both apply to categorical variables as well, but we just don’t talk about them as much.

It is important to remember that if you have a fixed effect for institutions, for example, you cannot include institution-level variables in the analysis. This is because those institution-level variables can be perfectly predicted from the fixed effects, and thus you would introduce perfect-collinearity. In other words, fixed effects capture any systematic variation in your other variables across clusters. Not only that, but you cannot include any variables at levels higher than institutions, such as states. Anything that does not vary within institutions is taken care of and cannot be included.
Often, it is unfeasible to include a dummy variable for each cluster in your model, especially with a large number of clusters (e.g., data that has every institution of higher education in the United States). Not only is this hard, but it is also very detrimental to the power of your model. As such, de-meaning is the approach often taken to include fixed effects in an estimation when there is a large number of clusters.

As a presentation note, people often do not present the actual fixed effects estimates in their results. They often just note that fixed effects were included, but don’t present the actual numbers because they are not typically interpreted.

Fixed effects in panel analysis

Fixed effects are a particularly powerful tool when analyzing panel data, where you have the same individuals or observations across time. For this example, I will use data from Griffiths, Lim, and Hill (2008). These data include an lwage variable, which is presumably the log of wages; an hours variable, which is hours worked per week; an id variable, which is a unique ID for the individual faculty member; and a year variable, which includes the year of the data. There are five years - 1982-1988 - and there are 716 unique people.

Let’s say we wanted to again look at wage and hours worked. (We took female out of the equation because this data doesn’t include it.) The model looks like this.

$log(wage_{iy}) = \beta_0 + \beta_1 (hours_{iy}) + \varepsilon_{iy}$

Here, the $i$ refers to the person, and the $y$ refers to the year. No fixed effects yet - let’s graph it.

# Create scatter plot
ggplot(d, aes(x=hours, y=lwage)) +
  geom_point() +
  xlab("Hours worked per week") +
  ylab("Log of weekly salary")

If you were just to run the above model, what you would be getting is the relationship between hours and wage, as shown below.

# Run model
model <- lm(lwage ~ hours, data=d)
summary(model)

term	estimate	std.error	statistic	p.value
(Intercept)	1.78731	0.03780	47.28495	0.00000
hours	0.00344	0.00097	3.53902	0.00041

The problem with this model is that there might be systematic differences across people such that people who work longer hours also have other characteristics that contribute to a higher wage. Perhaps it is not the hours that people work that matters for wages, but the kind of work people do that contributes to them needing to work longer hours and also have higher wages. What you would ideally want is some model that estimates how an individual’s changes in hours are associated with changes in wage. In a similar vein, you might also be concerned that as people work longer, their hours might go up, but so might their wage just because of yearly raises. As such, you’d ideally want to account for the year.

Enter: fixed effects!

Consider the following fixed effects model.

$log(wage_{iy}) = \beta_0 + \beta_1 (hours_{iy}) + \delta_i + \eta_y + \varepsilon_{iy}$

Here, we have included an “individual fixed effect” - $\delta_i$ - and a year fixed effect - $\eta_y$ . Because of the “within-group” interpretation you can do with fixed effects, the “within-individual” interpretation that is enabled by the inclusion of $\delta_i$ means that you can interpret $\beta_1$ as the association of changes in hours with changes in wage.

Importantly, we have also included $\eta_y$ , which is a set of dummies for each year. To be fair, this is not the most important part of the model, but this takes care of the yearly changes in wages and hours. Why didn’t we include years as a continuous variable? Well, including it as a fixed effect allows us to account for non-linearities. Let’s say there was a sudden jump in higher education funding in one year. You’d expect wages to go up a lot that year.

Let’s run it! Wait, but how? Remember, we have two options: including all of the dummies, or de-meaning the data. How exactly you run this model depends on the particular software used. Below shows it being run in R, using the plm package. Most packages, including this one, perform the analysis via de-meaning.

# Load the package
library(plm)

Loading required package: Formula

# Set the data up as a panel
d_panel <- pdata.frame(d, index=c("id", "year"))
# Run the model
model <- plm(lwage ~ hours, data=d_panel, model="within")
summary(model)

term	estimate	std.error	statistic	p.value
hours	-0.00438	0.00079	-5.53462	0

Because of the use of the fixed effect, we can interpret the coefficient on hours now as the expected change in log wages for a one unit change in hours for an individual over time.

What is nice about this is that you essentially capture any and all systematic differences across people that are time-invariant, meaning that they do not change over time and thus, there is no variation in our data within individuals. Say, for the sake of argument, that none of our faculty changed genders within this time frame. As such, we do not have to worry about the confounding role of gender in wages because that is taken care of in the individual fixed effect. In fact, we do not have to worry about any time-invariant characteristics of individuals. (You still should, in your models, control for any time-variant characteristics of individuals that would predict both weekly hours and wages.)

Why is this good? For many people, this is a big, huge step in analyses towards being able to claim causality. While there are still issues in terms of being able to claim the directionality of the relationship and with possible time-variant confounders, the elimination of time-invariant characteristics is often a really big deal for us as social scientists. So, be careful in your use of fixed effects for causality, but also recognize the great benefits that there are to these models. Finally, most importantly, remember to call your mother because she would like to hear from you.

A Primer on Fixed Effects

Richard Blissett

2017-09-21

Categorical independent variables and de-meaning

Fixed effects in cross-sectional analysis

Fixed effects in panel analysis