Fixed effects are a very popular method in education policy. In my opinion, the discussion of their methods is often over-complicated, because in reality, the way that fixed effects work is not that different from some things that people already understand. The purpose of this page is to give an overview of fixed effects and their use in data analysis in the education policy world. You don’t have to worry about understanding the R code, especially if you are not using R, but pay attention to the output.
Before we begin, let’s consider the following regression model.
weeksal=β0+β1(female)+ε
Here, we are looking at an ordinary least squares regression of weekly salary on a binary indicatory (0/1) for whether or not someone is female.
Before we even run a model, let’s check out the data. Just in terms of the data in front of us, is there a difference between males and females on weekly salary? Let’s check it out (with data that I made up, in a dataframe called d
).
# Show the average salary for women
mean(d[d$female==1, "weeksal"])
[1] 1666.991
# Show the average salary for men
mean(d[d$female==0, "weeksal"])
[1] 1718.737
Is the difference statistically significant? Sounds like a two-sample independent t-test to me.
# Check statistical significance
t.test(d[d$female==1, "weeksal"], d[d$female==0, "weeksal"])
Welch Two Sample t-test
data: d[d$female == 1, "weeksal"] and d[d$female == 0, "weeksal"]
t = -9.6156, df = 994.92, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-62.30632 -41.18572
sample estimates:
mean of x mean of y
1666.991 1718.737
Lo and behold, there is a statistically significant difference between the weekly salary for female faculty and the weekly salary for male faculty. The difference is 51.75. Keep this in mind.
Let’s try this out using the regression we presented above.
# Test using linear regression
model <- lm(weeksal ~ female, data=d)
summary(model)
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 1718.73730 | 3.89366 | 441.41998 | 0 |
female | -51.74602 | 5.38917 | -9.60185 | 0 |
Based on the results, we might construct a predictive model that looks like this.
^weeksal=1718.74−51.75(female)
Do the values of the intercept and the β1 coefficient look familiar? They should. They are exactly the mean salary of males as shown before, and the difference between males and females as we discussed. This should make sense: the interpretation of the intercept of a regression model is the predicted value of the outcome variable when the independent variable (female) is equal to 0, so the intercept is basically the predicted value for males. To get the predicted value for females, we would enter a value of 1 for the female variable - 925.06-167.03(1) = 758.03. This is exactly the mean for females as we saw before. Fun, right? The regression coefficent on the female variable is exactly the systematic difference between males and females in terms of weekly salary.
Let’s extend this further. Let’s say we actually wanted to know the relationship between hours worked and weekly salary, but we wanted to control for sex. In other words, we were interested in β2 in the following equation.
weeksal=β0+β1(female)+β2(hours)+ε
Let’s do it!
# Run updated model
model <- lm(weeksal ~ female + hours, data=d)
summary(model)
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 969.74886 | 51.16251 | 18.95429 | 0 |
female | -144.84075 | 8.00968 | -18.08322 | 0 |
hours | 18.74068 | 1.27710 | 14.67442 | 0 |
It looks like that for every hour increase in hours worked, there is an expected 19.71 increase in weekly salary controlling for sex. Why is this control important? Well, let’s check for the relationship without controlling for sex.
# Run updated model
model <- lm(weeksal ~ hours, data=d)
summary(model)
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 1672.60880 | 38.32027 | 43.64815 | 0.00000 |
hours | 0.44919 | 0.89797 | 0.50023 | 0.61703 |
Now, we find no significant relationship between hours and weekly salary. Why might this be? A graph might help.
# Create scatter plot
ggplot(d, aes(x=hours, y=weeksal)) +
geom_point() +
xlab("Hours worked") +
ylab("Weekly salary")
Doesn’t look like there’s a relationship, does there? Let’s color the dots by sex, though.
# Create scatter plot
ggplot(d, aes(x=hours, y=weeksal, colour=factor(female))) +
geom_point() +
xlab("Hours worked") +
ylab("Weekly salary") +
scale_colour_discrete(name="Female")
Hm, now we might have something. It looks like generally, if we were do draw an independent regression line for each sex, we would see a positive relationship between hours worked and weekly salary. This is basically what controlling for the female
variable does: if females and males were equal on hours worked and if females and males were equal on weekly salary, this is would the relationship between the two variables would look like. So imagine taking those two big colored clusters of data and dragging them so their centroids were on top of each other. In fact, we can do that by subtracting the mean of each variable for each sex from the data. This process is called “de-meaning.”
new_weeksal=weeksal−¯weeksalsex new_hours=hours−¯hourssex
# Get means
f_weeksal_mean <- mean(d[d$female==1, "weeksal"])
m_weeksal_mean <- mean(d[d$female==0, "weeksal"])
f_hours_mean <- mean(d[d$female==1, "hours"])
m_hours_mean <- mean(d[d$female==0, "hours"])
# Copy variables
d$new_weeksal <- d$weeksal
d$new_hours <- d$hours
# Subtract means
d$new_weeksal[d$female==1] <- d$new_weeksal[d$female==1] - f_weeksal_mean
d$new_weeksal[d$female==0] <- d$new_weeksal[d$female==0] - m_weeksal_mean
d$new_hours[d$female==1] <- d$new_hours[d$female==1] - f_hours_mean
d$new_hours[d$female==0] <- d$new_hours[d$female==0] - m_hours_mean
Aaaand let’s get that graph again.
# Create scatter plot
ggplot(d, aes(x=new_hours, y=new_weeksal, colour=factor(female))) +
geom_point() +
xlab("Hours worked (de-meaned)") +
ylab("Weekly salary (de-meaned)") +
scale_colour_discrete(name="Female")
Looks like we’ve just moved those clusters on top of each other. Why don’t we try the regression of weekly salary on hours again?
new_weeksal=β0+β2(new_hours)+ε
# Run de-meaned model
model <- lm(new_weeksal ~ new_hours, data=d)
summary(model)
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 0.00000 | 2.44122 | 0.00000 | 1 |
new_hours | 18.74068 | 1.27646 | 14.68178 | 0 |
See that we now have a β2 value of 18.74. Notice that this is the exact same as the value we got for β2 in our original model when we had controlled for female.
What are the implications here? There are several ways you could think about this.
So what’s the summary here? The reason that we found it important to control for sex in the equation predicting weekly salary from hours worked is because there was a relationship between sex and hours worked and between sex and weekly salary. We can get rid of these relationships by, for each variable, subtracting out the mean for each sex. Alternatively, we can include a binary variable for sex in the model, which produces the exact same point estimate.
I should note, on the side, that this basic interpretation can be extended to controlling for continuous variables as well.
Now that we understand categorical variables, here’s a secret: Fixed effects (in the context of this page) are just a fancy extension of the idea of controlling for a categorical variable.
(Important note: The term “fixed effects” means something completely different in the multilevel modeling framework. I didn’t make the rules, sorry.)
Let’s suppose the different faculty in our study are in different institutions. We might be concerned that there are systematic differences in hours worked across institutions, and there might also be systematic differences in weekly salary. (I know faculty aren’t paid on a weekly salary, by the way. Just use your imagination.) Let’s look at that graph of hours verses salary again, but this time colored by institution (of which there are 10).
# Create scatter plot
ggplot(d, aes(x=hours, y=weeksal, colour=factor(instid))) +
geom_point() +
xlab("Hours worked") +
ylab("Weekly salary") +
scale_colour_discrete(name="Institution ID")
Hmm, looks like we may indeed have some significant differences in salary and hours worked across institutions. How would we take care of this? Same way we took care of sex in the previous models. We can either (a) subtract out the institution-level means for the variables, or (b) control for the institution in the model. Remember that controlling for a multi-category variable involves including a dummy variable for each category and leaving one out. The model would look something like this.
weeksalij=β0+β1(hoursij)+β2(femaleij)+γj+εij
Note that we’ve now added some indices for clarity, where i is the faculty member and j is the institution. So what is γj? We could have written out every dummy variable for each institution (except one). However, often, there are too many clusters for this to be practical. We write it as γj, which we call the institution fixed effect. It is shorthand for indicating that we either put in dummy variables for the institutions, OR we de-meaned the data by institution. Let’s run the model.
# Create dummy variables
d$instid_1 <- ifelse(d$instid==1, 1, 0)
d$instid_2 <- ifelse(d$instid==2, 1, 0)
d$instid_3 <- ifelse(d$instid==3, 1, 0)
d$instid_4 <- ifelse(d$instid==4, 1, 0)
d$instid_5 <- ifelse(d$instid==5, 1, 0)
d$instid_6 <- ifelse(d$instid==6, 1, 0)
d$instid_7 <- ifelse(d$instid==7, 1, 0)
d$instid_8 <- ifelse(d$instid==8, 1, 0)
d$instid_9 <- ifelse(d$instid==9, 1, 0)
d$instid_10 <- ifelse(d$instid==10, 1, 0)
# Running the model
model <- lm(weeksal ~ hours + female + instid_2 + instid_3 + instid_4 +
instid_5 + instid_6 + instid_7 + instid_8 +
instid_9 + instid_10, data=d)
summary(model)
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 949.71224 | 51.62828 | 18.39520 | 0.00000 |
hours | 18.76000 | 1.27247 | 14.74297 | 0.00000 |
female | -166.94949 | 9.83248 | -16.97939 | 0.00000 |
instid_2 | 14.32314 | 10.85697 | 1.31926 | 0.18739 |
instid_3 | 20.52511 | 10.86608 | 1.88891 | 0.05920 |
instid_4 | 36.36197 | 11.02204 | 3.29902 | 0.00100 |
instid_5 | 18.51329 | 11.30493 | 1.63763 | 0.10182 |
instid_6 | 33.91396 | 11.86406 | 2.85855 | 0.00435 |
instid_7 | 39.39121 | 12.55303 | 3.13799 | 0.00175 |
instid_8 | 40.28037 | 12.99792 | 3.09899 | 0.00200 |
instid_9 | 48.82702 | 13.03837 | 3.74487 | 0.00019 |
instid_10 | 55.41673 | 13.20897 | 4.19539 | 0.00003 |
Focus on the coefficient on the hours variable. Remember those three interpretations I gave of the inclusion of the female variable above? Those same three interpretations apply here.
So essentially, you can view a fixed effect analysis as one that controls for systematic differences across clusters, just like controlling for a categorical variable. There are two important distinctions between how we treat fixed effects and how we treat categorical variables. These things both apply to categorical variables as well, but we just don’t talk about them as much.
As a presentation note, people often do not present the actual fixed effects estimates in their results. They often just note that fixed effects were included, but don’t present the actual numbers because they are not typically interpreted.
Fixed effects are a particularly powerful tool when analyzing panel data, where you have the same individuals or observations across time. For this example, I will use data from Griffiths, Lim, and Hill (2008). These data include an lwage
variable, which is presumably the log of wages; an hours
variable, which is hours worked per week; an id
variable, which is a unique ID for the individual faculty member; and a year
variable, which includes the year of the data. There are five years - 1982-1988 - and there are 716 unique people.
Let’s say we wanted to again look at wage and hours worked. (We took female out of the equation because this data doesn’t include it.) The model looks like this.
log(wageiy)=β0+β1(hoursiy)+εiy
Here, the i refers to the person, and the y refers to the year. No fixed effects yet - let’s graph it.
# Create scatter plot
ggplot(d, aes(x=hours, y=lwage)) +
geom_point() +
xlab("Hours worked per week") +
ylab("Log of weekly salary")
If you were just to run the above model, what you would be getting is the relationship between hours and wage, as shown below.
# Run model
model <- lm(lwage ~ hours, data=d)
summary(model)
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 1.78731 | 0.03780 | 47.28495 | 0.00000 |
hours | 0.00344 | 0.00097 | 3.53902 | 0.00041 |
The problem with this model is that there might be systematic differences across people such that people who work longer hours also have other characteristics that contribute to a higher wage. Perhaps it is not the hours that people work that matters for wages, but the kind of work people do that contributes to them needing to work longer hours and also have higher wages. What you would ideally want is some model that estimates how an individual’s changes in hours are associated with changes in wage. In a similar vein, you might also be concerned that as people work longer, their hours might go up, but so might their wage just because of yearly raises. As such, you’d ideally want to account for the year.
Enter: fixed effects!
Consider the following fixed effects model.
log(wageiy)=β0+β1(hoursiy)+δi+ηy+εiy
Here, we have included an “individual fixed effect” - δi - and a year fixed effect - ηy. Because of the “within-group” interpretation you can do with fixed effects, the “within-individual” interpretation that is enabled by the inclusion of δi means that you can interpret β1 as the association of changes in hours with changes in wage.
Importantly, we have also included ηy, which is a set of dummies for each year. To be fair, this is not the most important part of the model, but this takes care of the yearly changes in wages and hours. Why didn’t we include years as a continuous variable? Well, including it as a fixed effect allows us to account for non-linearities. Let’s say there was a sudden jump in higher education funding in one year. You’d expect wages to go up a lot that year.
Let’s run it! Wait, but how? Remember, we have two options: including all of the dummies, or de-meaning the data. How exactly you run this model depends on the particular software used. Below shows it being run in R, using the plm
package. Most packages, including this one, perform the analysis via de-meaning.
# Load the package
library(plm)
Loading required package: Formula
# Set the data up as a panel
d_panel <- pdata.frame(d, index=c("id", "year"))
# Run the model
model <- plm(lwage ~ hours, data=d_panel, model="within")
summary(model)
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
hours | -0.00438 | 0.00079 | -5.53462 | 0 |
Because of the use of the fixed effect, we can interpret the coefficient on hours now as the expected change in log wages for a one unit change in hours for an individual over time.
What is nice about this is that you essentially capture any and all systematic differences across people that are time-invariant, meaning that they do not change over time and thus, there is no variation in our data within individuals. Say, for the sake of argument, that none of our faculty changed genders within this time frame. As such, we do not have to worry about the confounding role of gender in wages because that is taken care of in the individual fixed effect. In fact, we do not have to worry about any time-invariant characteristics of individuals. (You still should, in your models, control for any time-variant characteristics of individuals that would predict both weekly hours and wages.)
Why is this good? For many people, this is a big, huge step in analyses towards being able to claim causality. While there are still issues in terms of being able to claim the directionality of the relationship and with possible time-variant confounders, the elimination of time-invariant characteristics is often a really big deal for us as social scientists. So, be careful in your use of fixed effects for causality, but also recognize the great benefits that there are to these models. Finally, most importantly, remember to call your mother because she would like to hear from you.