Advanced Panel Data Methods
Lab Overview
- Pooled OLS
- First-Differences
- Fixed-Effects
- Serial Correlation
The Effect of Enterprise Zones
Papke (1994) studied the effect of the Indiana enterprise zone (EZ) program on unemployment claims. She analyzed 22 cities in Indiana over the period from 1980 to 1988.
Six enterprise zones were designated in 1984, and four more were assigned in 1985. Twelve of the cities in the sample did not receive an enterprise zone over this period; they served as the control group.
EZ incentives consist of tax instruments, such as property tax abatements, income tax deductions and credits for employment creation, capital investment, and income creation in the EZs.
At the time, Indiana had twenty-five municipal EZs and three EZs located on closed military bases.
Goal: Our goal is to assess the impact of the enterprise zone program on unemployment using three tools: pooled OLS, first differences and the fixed effects estimator.
Our dataset contains the following variables:
cityidyearuclmsnumber of unemployment claims filledluclmslog of the number of unemployment claims filledguclmsgrowth rate of unemployment claims filledezindicator for enterprise zonecezchange in variableez- year dummies
- city dummies
1. Getting Started
Download Lab 8’s materials from Moodle:
- Save provided data set in your
datafolder in BRM-Labs project folder. - Save provided R script in your
codefolder in BRM-Labs project folder.
- Save provided data set in your
Open the provided lab 8’s R script.
Setup your R environment.
# Clean work environment
rm(list = ls()) # USE with CAUTION: this will delete everything in your environment
# Load packages
library(tidyverse)
library(stargazer)
library(ggthemes)
library(GGally)
library(skimr)
library(corrr)
library(knitr)
library(infer)
library(janitor)
library(plm)
library(lmtest)
- Load the data.
# Load data, convert to tibble format, discard original
load("data/ezunem.RData")
tb.ez <- as_tibble(data)
rm(data)
2. Data Exploration
2.1 Summary Statistics
Produce a summary statistics table:
# Summary statistics
stargazer(data.frame(tb.ez %>% dplyr::select(uclms, luclms, ez)),
median = TRUE,
type = "text",
header = FALSE, font.size = "small",
title = "Summary Statistics")
Summary Statistics
===========================================================
Statistic N Mean St. Dev. Min Median Max
-----------------------------------------------------------
uclms 198 95,383.390 89,173.630 12,360 69,170.5 667,208
luclms 198 11.191 0.714 9.422 11.144 13.411
ez 198 0.232 0.423 0 0 1
-----------------------------------------------------------
Check if panel is balanced, that is - does each unit of analysis (city) contain data for all time periods (9 years of data)?
# Check for panel balance
tb.ez %>%
dplyr::select(year,city) %>%
table()
city
year 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
1980 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1981 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1982 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1983 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1984 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1985 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1986 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1987 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1988 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
# Alternatively
is.pbalanced(tb.ez, index = c("city", "year"))
[1] TRUE
We can confirm that this is a balanced panel.
Check the timing of treatment:
# When did treatment start?
tb.ez %>%
dplyr::select(ez, year) %>%
table()
year
ez 1980 1981 1982 1983 1984 1985 1986 1987 1988
0 22 22 22 22 16 12 12 12 12
1 0 0 0 0 6 10 10 10 10
Six cities received the EZ in 1984 and 4 more cities in 1985.
2.2 Visualization
Distribution of unemployment claims:
# Plot the distribution of unemployment claims
ggplot(tb.ez, aes(x = uclms)) +
geom_histogram() +
theme_bw()
# Plot the distribution of log(unemployment claims)
ggplot(tb.ez, aes(x = luclms)) +
geom_histogram() +
theme_bw()
The distribution of the number of unemployment claims is highly positively skewed. As it takes only positive values we can take the log of this variable (luclms). When we take the log of unemployment claims we obtain a bell-shaped more symmetrical distribution. Oftentimes, this type of transformation improves the fit of the model and corrects heteroskedasticity.
Unemployment time trend:
# Plot unemployment over time
ggplot(tb.ez, aes(x = year, y = luclms)) +
geom_point() +
stat_smooth() +
theme_bw() +
labs(x='year',y='Unemployment Claims')
In the plot above, we can see a general downward trend in the number of unemployment claims.
Unemployment by year for eventually treated or untreated cities:
# Create treatment indicator
tb.ez <- tb.ez %>%
group_by(city) %>%
mutate(treated = max(ez))
# Plot unemployment by year for EZ/non-EZ cities
ggplot(tb.ez, aes(x = factor(year)
, y = luclms
, fill = factor(treated, levels = c(0,1), labels = c("No EZ", "EZ")))) +
geom_boxplot() + theme_bw() +
theme(legend.position='bottom') +
scale_fill_tableau() +
labs(x = 'year'
, y = 'Unemployment Claims'
, fill = '')
The plot above, shows us that the cities that received and enterprise zone had, on average, higher levels of unemployment. In the last two years of the period analyzed, we can see that the median number of unemployment claims was lower in cities with an enterprise zone than cities without the EZ.
Unemployment by year for each city:
# Plot unemployment by year for each city
ggplot(tb.ez,aes(x = factor(year), y = luclms, col = factor(city))) +
geom_point() + geom_line(aes(group = factor(city))) +
theme_bw() +
theme(legend.position = "bottom") +
labs(x = 'year'
, y = 'Unemployment Claims'
, col = 'city') +
guides(color = guide_legend(nrow = 2, byrow = TRUE))
Generally, unemployment claims were falling statewide over this period. This should be reflected in the different year intercepts. In order to control for the time trend, you need to include a dummy variable for each time period, leaving out the first time period. The coefficient of each time dummy will tell you how the outcome at that time period was different from the omitted time period (usually the first).
3. Analysis
Before starting your regression analysis, it is important to carefully consider the structure of your data and the timing of the treatment. Here are some key points to ponder before diving into your analysis:
Treatment Timing: Did the enterprise zone treatment start at the same time for all cities? In other words, is there a clear-cut before-and-after period for the enterprise zone program’s implementation in the data? If not, how does the variation in the timing of treatment affect our analysis?
OLS Suitability: Should we use Ordinary Least Squares (OLS) to estimate the effect of enterprise zones on unemployment? Consider the assumptions of OLS and whether they are violated in this context.
First Differences Approach: Why might we choose to use the First Differences approach in this analysis? How does this method work, and what are its advantages for addressing our research question? How does it help us address the challenge of time-constant unobserved heterogeneity?
Fixed Effects Approach: Why might we choose to use the Fixed Effects approach in this analysis? How does this method work, and what are its advantages for addressing our research question? How does it help us address the challenge of time-constant unobserved heterogeneity?
Model Formulation: What variables should be included in our model? How should we control for time trend effects?
To estimate panel data models in R, we use a new command
plm (panel linear model), implemented in the package with
the same name. The syntax is similar to that of lm, but it
includes an option model to choose which kind of model is
to be estimated (e.g. pooling for pooled OLS,
fd for first-differences, and within for fixed
effects). The option index is used to identify which
variables compose the panel structure (variable for individuals/unit of
observation, and variable for time, in this order). plm
requires that the data is ordered in ascending order of the variables in
index.
A simple model to study the impact of enterprise zones is: \[log(uclms_{it}) = \theta_1 + \theta_2 d_{81} + ... + \theta_9 d_{88} + \beta_1 ez_{it} + a_i + u_{it}\]
where uclmsit is the number of unemployment claims filed
during year \(t\) in city \(i\), \(\theta_t\) denotes a different intercept
for each time period, \(ez_{it}\) is an
indicator of whether city \(i\) was an
enterprise zone at time \(t\). We are
interested in \(\beta_1\).
What could the \(a_i\) be?
How could \(ez\) and \(a_i\) be related?
The unobserved effect \(a_i\) represents fixed factors that affect the economic climate in city \(i\). Because enterprise zone designation was not determined randomly (enterprise zones are usually economically depressed areas), it is likely that \(ez_{it}\) and \(a_i\) are positively correlated (high \(a_i\) means higher unemployment claims, which lead to a higher chance of being given an EZ). Thus, we are in a situation where we should use first-differences or fixed-effects estimation to eliminate the \(a_i\).
3.1 Pooled OLS
Note that if we use pooled OLS, our estimate of the impact of EZ will be biased as we do not remove the unobserved time-constant factors that are related both to EZ and unemployment claims.
# Pooled OLS
out.pl <- plm(luclms ~ ez + d81 + d82 + d83 + d84 + d85 + d86 + d87 + d88, data = tb.ez
, index = c('city','year')
, model = 'pooling')
stargazer(out.pl, type = "text"
, omit.stat = c("ser", "f")
, no.space = TRUE, header = FALSE
, title = "Pooled OLS results")
Pooled OLS results
========================================
Dependent variable:
---------------------------
luclms
----------------------------------------
ez -0.039
(0.115)
d81 -0.322*
(0.177)
d82 0.135
(0.177)
d83 -0.219
(0.177)
d84 -0.597***
(0.180)
d85 -0.622***
(0.185)
d86 -0.651***
(0.185)
d87 -0.919***
(0.185)
d88 -1.257***
(0.185)
Constant 11.694***
(0.125)
----------------------------------------
Observations 198
R2 0.354
Adjusted R2 0.323
========================================
Note: *p<0.1; **p<0.05; ***p<0.01
3.2 First-Differences
What are the key assumptions of the first-differences approach?
The change in unobserved heterogeneity (\(\nu_t = u_{it} - u_{it-1}\)) is uncorrelated with the changes in the independent variables (\(\delta x_{it} = x_{it}-x_{it-1}\)). (You should think about this problem and whether or not this assumption is reasonable.)
Also, there must be variation in the changes. Example: suppose you want to assess the impact of education on wages but you cannot observe ability. You then assume ability is time-constant and so you will use panel data and do first differences to get rid of the effect of ability. You have to think: when will you collect the data? If you collect the data from one year to the next on a sample of working people, education is likely to not change for anyone so you will not be able to estimate its effect using first differences. But if you collect data with a 10-year difference, then the education level of some people in your sample will have changed.
To ensure R drops the constant you need to include a
0 (meaning “no intercept”) in your model formula.
# First differences (without intercept)
out.fd1 <- plm( luclms ~ 0 + ez + d81 + d82 + d83 + d84 + d85 + d86 + d87 + d88
, data = tb.ez
, index = c('city','year')
, model = 'fd')
stargazer( out.fd1, type = "text"
, omit.stat = c("ser", "f")
, no.space = TRUE, header = FALSE
, title = "First-Differences (without intercept) results")
First-Differences (without intercept) results
========================================
Dependent variable:
---------------------------
luclms
----------------------------------------
ez -0.182**
(0.078)
d81 -0.322***
(0.046)
d82 0.135**
(0.065)
d83 -0.219***
(0.080)
d84 -0.558***
(0.095)
d85 -0.557***
(0.109)
d86 -0.586***
(0.118)
d87 -0.854***
(0.127)
d88 -1.192***
(0.135)
----------------------------------------
Observations 176
R2 0.623
Adjusted R2 0.605
========================================
Note: *p<0.1; **p<0.05; ***p<0.01
Note that if you keep the intercept in the model and include a dummy
for each time period except for the first (1980), the model will not
run. This happens because the constant in a first-differences model
already represents the difference between two years. This is not an
issue if we are using factor(year) in our model instead of
manually adding each time dummy. In this case, R will
automatically drop an additional one of the year dummies or the constant
(depending on the version of R that you have).
The estimate of \(\beta_1\) is \(-0.182\). The interpretation of the coefficients in a first-differences regression is exactly the same as that of the regular OLS coefficients. If \(\delta x_{it} = x_{it}-x_{it-1}\) increases by one unit, that is the same as a one-unit increase in \(x_{it}\). Therefore, we still think of the coefficients as the amount by which the dependent variable changes (\(y\)) if the independent variable (\(x\)) increases by one unit.
So, what is the effect of treatment?
# Store first differences coefficients
coeffs <- coefficients(out.fd1)
# Estimate treatment effect
(exp(coeffs[1])-1)*100
ez
-16.62966
The presence of an EZ causes a 16.6% (\((exp(-0.181878)-1)\times 100\%\) fall in unemployment claims. This is an economically large and statistically significant effect.
3.3 Fixed effects
When we used the first-differences method, we lost our unobserved
effect \(a_i\) because it is constant
over time. An alternative way to do this is to use the fixed
effects estimator. The unobserved heterogeneity, \(a_i\), is related to the city, not the
period: if we consider all the cities in 1983, they will all have
different \(a_i\); on the other hand,
if we consider city A in all the years, the \(a_i\) will always be the same. Then, what
we do is include dummy variables for each cross-sectional observation
(in fact, R does it for us).
Strict Exogeneity Assumption: For the fixed effects estimator to be unbiased, there is a strict exogeneity assumption on the explanatory variables. This means that the idiosyncratic error term (\(u_{it}\)) should be uncorrelated with each explanatory variable across all time periods.
# Fixed effects
out.fe <-plm( luclms ~ 0 + ez + d81 + d82 + d83 + d84 + d85 + d86 + d87 + d88
, data = tb.ez
, index = c('city','year')
, model = 'within')
stargazer( out.fe, type = "text"
, omit.stat = c("ser", "f")
, no.space = TRUE, header = FALSE
, title = "Fixed effects results")
Fixed effects results
========================================
Dependent variable:
---------------------------
luclms
----------------------------------------
ez -0.104*
(0.055)
d81 -0.322***
(0.060)
d82 0.135**
(0.060)
d83 -0.219***
(0.060)
d84 -0.579***
(0.062)
d85 -0.592***
(0.065)
d86 -0.621***
(0.065)
d87 -0.889***
(0.065)
d88 -1.228***
(0.065)
----------------------------------------
Observations 198
R2 0.842
Adjusted R2 0.813
========================================
Note: *p<0.1; **p<0.05; ***p<0.01
As with the first-differences, the interpretation of the
fixed-effects coefficient is the same as with regular OLS. Note that, in
this case, R would automatically drop the constant of the
model.
We can achieve the same result by using “Pooled OLS” and manually adding an indicator for each city:
# Pooled ols with city fixed effects
out.pool <- plm(luclms ~ ez + d81 + d82 + d83 + d84 + d85 + d86 + d87 + d88 + factor(city)
, data = tb.ez
, index = c('city','year')
, model = 'pooling')
stargazer(out.pool, type = "text"
, omit.stat = c("ser", "f")
, omit = c('.*city.*')
, omit.labels=c('City Dummies')
, no.space = TRUE
, header = FALSE
, title = "Pooled OLS (with city dummies) results")
Pooled OLS (with city dummies) results
========================================
Dependent variable:
---------------------------
luclms
----------------------------------------
ez -0.104*
(0.055)
d81 -0.322***
(0.060)
d82 0.135**
(0.060)
d83 -0.219***
(0.060)
d84 -0.579***
(0.062)
d85 -0.592***
(0.065)
d86 -0.621***
(0.065)
d87 -0.889***
(0.065)
d88 -1.228***
(0.065)
Constant 11.676***
(0.080)
----------------------------------------
City Dummies Yes
----------------------------------------
Observations 198
R2 0.933
Adjusted R2 0.921
========================================
Note: *p<0.1; **p<0.05; ***p<0.01
3.4 Summary
In short, we have three different methods that yield the following results:
# Output pooling, fd, and fe results
stargazer(out.pool, out.pl, out.fd1, out.fe
, type = "text"
, column.labels = c('Pool with dummies', 'Pool no city dummies', 'FD', 'FE')
, omit.stat = c("ser", "f")
# hide all variables started by "d" or with the word "city"
, omit = c('d.*','.*city.*')
, omit.labels=c('Time Dummies', 'City Dummies')
, model.names = FALSE
, no.space = TRUE
, header = FALSE
, title = "Summary results")
Summary results
====================================================================
Dependent variable:
-------------------------------------------------------
luclms
Pool with dummies Pool no city dummies FD FE
(1) (2) (3) (4)
--------------------------------------------------------------------
ez -0.104* -0.039 -0.182** -0.104*
(0.055) (0.115) (0.078) (0.055)
Constant 11.676*** 11.694***
(0.080) (0.125)
--------------------------------------------------------------------
Time Dummies Yes Yes Yes Yes
City Dummies Yes No No No
--------------------------------------------------------------------
Observations 198 198 176 198
R2 0.933 0.354 0.623 0.842
Adjusted R2 0.921 0.323 0.605 0.813
====================================================================
Note: *p<0.1; **p<0.05; ***p<0.01
First of all, the first differences estimator and the fixed-effects estimator are different. We know that the first differences and the fixed effect estimator are exactly the same when there are two time periods. When there are more time periods, the fixed effects estimator is more efficient if the classical assumptions hold. However, if there is severe serial correlation, first differencing may be better.
The fixed-effects estimator is the same as the pooled OLS with all the dummies (remember we added them to correct for the bias?). So, we can do it like this (using the dummies).
In fact, we could have ran all of these regressions using
lm instead of plm, if we took the time to
create the first-differences, or the demeaned, variables ourselves.
plm saves us the trouble of doing so.
4. Serial Correlation
4.1 What is serial correlation?
What is serial correlation?
The errors \(u_{it}\) are correlated over time.
What does it mean “they are correlated over time?”
Positive Serial Correlation: This means that if the error is positive in one period, you are more likely to have a positive error in the following period.
Negative Serial Correlation: This means that if the error is negative in one period, you are more likely to have a positive error in the following period.
Note: Regression errors are not observed and therefore we estimate them using the regression residuals \(\hat{u}_{it} = y_{it} - \hat{y}_{it}\)
Why can we have serial correlation in panel data?
Because the same unit of analysis (same individual, same city, or same state/country) shows up multiple times in our data.
What are the consequences of serial correlation?
The consequences of Positive Serial Correlation are:
- Standard errors are underestimated
- t-statistics are inflated
- Type-I error increases (you incorrectly reject the null)
The consequences of Negative Serial Correlation are:
- Standard errors are overstated
- t-statistics are understated
- Type-II error increases (you incorrectly do not reject the null)
Positive serial correlation is a more serious problem than negative serial correlation.
4.2 Detecting serial correlation
We can easily test for the presence of serial correlation in our models. Lets test for serial correlation in our fixed effects model:
# Breusch–Godfrey Test of serial correlation
pbgtest(out.fe)
Breusch-Godfrey/Wooldridge test for serial correlation in panel models
data: luclms ~ 0 + ez + d81 + d82 + d83 + d84 + d85 + d86 + d87 + d88
chisq = 26.453, df = 9, p-value = 0.001722
alternative hypothesis: serial correlation in idiosyncratic errors
The p-value of the test is below 0.01, thus we reject the null hypothesis of no serial correlation in the fixed effects model. This means that we will need to adjust our model’s standard errors.
4.3 Correcting SEs for Serial Correlation
In order to compute standard errors robust to serial correlation (and
heteroskedasticity) we can use the following:
vcov = plm::vcovHC(out.fe, method = "arellano")
# Correct SEs
robust_se <- sqrt(diag(plm::vcovHC(out.fe, method = "arellano")))
# Correct SEs in stargazer
stargazer( out.fe, out.fe
, se = list(NULL, robust_se)
, column.labels = c("Non-Robust SEs", "Robust SEs")
, type = "text"
, omit.stat = c("ser","f")
, omit = c('d.*')
, omit.labels = c('Time Dummies')
, model.names = FALSE
, dep.var.labels.include = FALSE
, no.space = TRUE
, header = FALSE
, title = "Serial correlation results")
Serial correlation results
=========================================
Dependent variable:
----------------------------
Non-Robust SEs Robust SEs
(1) (2)
-----------------------------------------
ez -0.104* -0.104
(0.055) (0.069)
-----------------------------------------
Time Dummies Yes Yes
-----------------------------------------
Observations 198 198
R2 0.842 0.842
Adjusted R2 0.813 0.813
=========================================
Note: *p<0.1; **p<0.05; ***p<0.01
Note, that after the standard error correction the impact of the enterprise zones on unemployment claims is no longer statistically significant.