FSAR | Lab #8 | Fall 2025

Advanced Panel Data Methods

Lab Overview

Pooled OLS
First-Differences
Fixed-Effects
Serial Correlation

The Effect of Enterprise Zones

Papke (1994) studied the effect of the Indiana enterprise zone (EZ) program on unemployment claims. She analyzed 22 cities in Indiana over the period from 1980 to 1988.

Six enterprise zones were designated in 1984, and four more were assigned in 1985. Twelve of the cities in the sample did not receive an enterprise zone over this period; they served as the control group.
EZ incentives consist of tax instruments, such as property tax abatements, income tax deductions and credits for employment creation, capital investment, and income creation in the EZs.
At the time, Indiana had twenty-five municipal EZs and three EZs located on closed military bases.

Goal: Our goal is to assess the impact of the enterprise zone program on unemployment using three tools: pooled OLS, first differences and the fixed effects estimator.

Our dataset contains the following variables:

city id
year
uclms number of unemployment claims filled
luclms log of the number of unemployment claims filled
guclms growth rate of unemployment claims filled
ez indicator for enterprise zone
cez change in variable ez
year dummies
city dummies

1. Getting Started

Download Lab 8’s materials from Moodle:
- Save provided data set in your data folder in BRM-Labs project folder.
- Save provided R script in your code folder in BRM-Labs project folder.
Open the provided lab 8’s R script.
Setup your R environment.

# Clean work environment 
rm(list = ls()) # USE with CAUTION: this will delete everything in your environment

# Load packages
library(tidyverse)
library(stargazer)
library(ggthemes)
library(GGally)
library(skimr)
library(corrr)
library(knitr)
library(infer)
library(janitor)
library(plm)
library(lmtest)

Load the data.

# Load data, convert to tibble format, discard original
load("data/ezunem.RData")
tb.ez <- as_tibble(data)
rm(data)

2. Data Exploration

2.1 Summary Statistics

Produce a summary statistics table:

# Summary statistics
stargazer(data.frame(tb.ez %>% dplyr::select(uclms, luclms, ez)),
          median = TRUE,
          type = "text",
          header = FALSE, font.size = "small",
          title = "Summary Statistics")


Summary Statistics
===========================================================
Statistic  N     Mean     St. Dev.   Min    Median    Max  
-----------------------------------------------------------
uclms     198 95,383.390 89,173.630 12,360 69,170.5 667,208
luclms    198   11.191     0.714    9.422   11.144  13.411 
ez        198   0.232      0.423      0       0        1   
-----------------------------------------------------------

Check if panel is balanced, that is - does each unit of analysis (city) contain data for all time periods (9 years of data)?

# Check for panel balance
tb.ez %>%
  dplyr::select(year,city) %>%
  table()

      city
year   1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
  1980 1 1 1 1 1 1 1 1 1  1  1  1  1  1  1  1  1  1  1  1  1  1
  1981 1 1 1 1 1 1 1 1 1  1  1  1  1  1  1  1  1  1  1  1  1  1
  1982 1 1 1 1 1 1 1 1 1  1  1  1  1  1  1  1  1  1  1  1  1  1
  1983 1 1 1 1 1 1 1 1 1  1  1  1  1  1  1  1  1  1  1  1  1  1
  1984 1 1 1 1 1 1 1 1 1  1  1  1  1  1  1  1  1  1  1  1  1  1
  1985 1 1 1 1 1 1 1 1 1  1  1  1  1  1  1  1  1  1  1  1  1  1
  1986 1 1 1 1 1 1 1 1 1  1  1  1  1  1  1  1  1  1  1  1  1  1
  1987 1 1 1 1 1 1 1 1 1  1  1  1  1  1  1  1  1  1  1  1  1  1
  1988 1 1 1 1 1 1 1 1 1  1  1  1  1  1  1  1  1  1  1  1  1  1

# Alternatively
is.pbalanced(tb.ez, index = c("city", "year"))

[1] TRUE

We can confirm that this is a balanced panel.

Check the timing of treatment:

# When did treatment start?
tb.ez %>%
  dplyr::select(ez, year) %>%
  table()

   year
ez  1980 1981 1982 1983 1984 1985 1986 1987 1988
  0   22   22   22   22   16   12   12   12   12
  1    0    0    0    0    6   10   10   10   10

Six cities received the EZ in 1984 and 4 more cities in 1985.

to top

2.2 Visualization

Distribution of unemployment claims:

# Plot the distribution of unemployment claims
ggplot(tb.ez, aes(x = uclms)) +
  geom_histogram() +
  theme_bw()

# Plot the distribution of log(unemployment claims)
ggplot(tb.ez, aes(x = luclms)) +
  geom_histogram() +
  theme_bw()

The distribution of the number of unemployment claims is highly positively skewed. As it takes only positive values we can take the log of this variable (luclms). When we take the log of unemployment claims we obtain a bell-shaped more symmetrical distribution. Oftentimes, this type of transformation improves the fit of the model and corrects heteroskedasticity.

Unemployment time trend:

# Plot unemployment over time
ggplot(tb.ez, aes(x = year, y = luclms)) +
  geom_point() +
  stat_smooth() +
  theme_bw() +
  labs(x='year',y='Unemployment Claims')

In the plot above, we can see a general downward trend in the number of unemployment claims.

Unemployment by year for eventually treated or untreated cities:

# Create treatment indicator
tb.ez <- tb.ez %>% 
  group_by(city) %>% 
  mutate(treated = max(ez))

# Plot unemployment by year for EZ/non-EZ cities
ggplot(tb.ez, aes(x = factor(year)
                , y = luclms
                , fill = factor(treated, levels = c(0,1), labels = c("No EZ", "EZ")))) + 
  geom_boxplot() + theme_bw() + 
  theme(legend.position='bottom') + 
  scale_fill_tableau() +
  labs(x    = 'year'
     , y    = 'Unemployment Claims'
     , fill = '')

The plot above, shows us that the cities that received and enterprise zone had, on average, higher levels of unemployment. In the last two years of the period analyzed, we can see that the median number of unemployment claims was lower in cities with an enterprise zone than cities without the EZ.

Unemployment by year for each city:

# Plot unemployment by year for each city
ggplot(tb.ez,aes(x = factor(year), y = luclms, col = factor(city))) + 
  geom_point() + geom_line(aes(group = factor(city))) + 
  theme_bw() + 
  theme(legend.position = "bottom") +
  labs(x = 'year'
     , y = 'Unemployment Claims'
     , col = 'city') +
  guides(color = guide_legend(nrow = 2, byrow = TRUE))

Generally, unemployment claims were falling statewide over this period. This should be reflected in the different year intercepts. In order to control for the time trend, you need to include a dummy variable for each time period, leaving out the first time period. The coefficient of each time dummy will tell you how the outcome at that time period was different from the omitted time period (usually the first).

to top

3. Analysis

Before starting your regression analysis, it is important to carefully consider the structure of your data and the timing of the treatment. Here are some key points to ponder before diving into your analysis:

Treatment Timing: Did the enterprise zone treatment start at the same time for all cities? In other words, is there a clear-cut before-and-after period for the enterprise zone program’s implementation in the data? If not, how does the variation in the timing of treatment affect our analysis?
OLS Suitability: Should we use Ordinary Least Squares (OLS) to estimate the effect of enterprise zones on unemployment? Consider the assumptions of OLS and whether they are violated in this context.
First Differences Approach: Why might we choose to use the First Differences approach in this analysis? How does this method work, and what are its advantages for addressing our research question? How does it help us address the challenge of time-constant unobserved heterogeneity?
Fixed Effects Approach: Why might we choose to use the Fixed Effects approach in this analysis? How does this method work, and what are its advantages for addressing our research question? How does it help us address the challenge of time-constant unobserved heterogeneity?
Model Formulation: What variables should be included in our model? How should we control for time trend effects?

To estimate panel data models in R, we use a new command plm (panel linear model), implemented in the package with the same name. The syntax is similar to that of lm, but it includes an option model to choose which kind of model is to be estimated (e.g. pooling for pooled OLS, fd for first-differences, and within for fixed effects). The option index is used to identify which variables compose the panel structure (variable for individuals/unit of observation, and variable for time, in this order). plm requires that the data is ordered in ascending order of the variables in index.

A simple model to study the impact of enterprise zones is: \[log(uclms_{it}) = \theta_1 + \theta_2 d_{81} + ... + \theta_9 d_{88} + \beta_1 ez_{it} + a_i + u_{it}\]

where uclmsit is the number of unemployment claims filed during year \(t\) in city \(i\), \(\theta_t\) denotes a different intercept for each time period, \(ez_{it}\) is an indicator of whether city \(i\) was an enterprise zone at time \(t\). We are interested in \(\beta_1\).

What could the \(a_i\) be?
How could \(ez\) and \(a_i\) be related?

The unobserved effect \(a_i\) represents fixed factors that affect the economic climate in city \(i\). Because enterprise zone designation was not determined randomly (enterprise zones are usually economically depressed areas), it is likely that \(ez_{it}\) and \(a_i\) are positively correlated (high \(a_i\) means higher unemployment claims, which lead to a higher chance of being given an EZ). Thus, we are in a situation where we should use first-differences or fixed-effects estimation to eliminate the \(a_i\).

3.1 Pooled OLS

Note that if we use pooled OLS, our estimate of the impact of EZ will be biased as we do not remove the unobserved time-constant factors that are related both to EZ and unemployment claims.

# Pooled OLS
out.pl <- plm(luclms ~ ez + d81 + d82 + d83 + d84 + d85 + d86 + d87 + d88, data = tb.ez
              , index = c('city','year')
              , model = 'pooling')

stargazer(out.pl, type = "text"
        , omit.stat = c("ser", "f")
        , no.space = TRUE, header = FALSE
        , title = "Pooled OLS results")


Pooled OLS results
========================================
                 Dependent variable:    
             ---------------------------
                       luclms           
----------------------------------------
ez                     -0.039           
                       (0.115)          
d81                    -0.322*          
                       (0.177)          
d82                     0.135           
                       (0.177)          
d83                    -0.219           
                       (0.177)          
d84                   -0.597***         
                       (0.180)          
d85                   -0.622***         
                       (0.185)          
d86                   -0.651***         
                       (0.185)          
d87                   -0.919***         
                       (0.185)          
d88                   -1.257***         
                       (0.185)          
Constant              11.694***         
                       (0.125)          
----------------------------------------
Observations             198            
R2                      0.354           
Adjusted R2             0.323           
========================================
Note:        *p<0.1; **p<0.05; ***p<0.01

to top

3.2 First-Differences

What are the key assumptions of the first-differences approach?

The change in unobserved heterogeneity (\(\nu_t = u_{it} - u_{it-1}\)) is uncorrelated with the changes in the independent variables (\(\delta x_{it} = x_{it}-x_{it-1}\)). (You should think about this problem and whether or not this assumption is reasonable.)
Also, there must be variation in the changes. Example: suppose you want to assess the impact of education on wages but you cannot observe ability. You then assume ability is time-constant and so you will use panel data and do first differences to get rid of the effect of ability. You have to think: when will you collect the data? If you collect the data from one year to the next on a sample of working people, education is likely to not change for anyone so you will not be able to estimate its effect using first differences. But if you collect data with a 10-year difference, then the education level of some people in your sample will have changed.

To ensure R drops the constant you need to include a 0 (meaning “no intercept”) in your model formula.

# First differences (without intercept)
out.fd1 <- plm( luclms ~ 0 + ez + d81 + d82 + d83 + d84 + d85 + d86 + d87 + d88
              , data  = tb.ez
              , index = c('city','year')
              , model = 'fd')

stargazer(  out.fd1, type = "text"
          , omit.stat = c("ser", "f")
          , no.space = TRUE, header = FALSE
          , title = "First-Differences (without intercept) results")


First-Differences (without intercept) results
========================================
                 Dependent variable:    
             ---------------------------
                       luclms           
----------------------------------------
ez                    -0.182**          
                       (0.078)          
d81                   -0.322***         
                       (0.046)          
d82                    0.135**          
                       (0.065)          
d83                   -0.219***         
                       (0.080)          
d84                   -0.558***         
                       (0.095)          
d85                   -0.557***         
                       (0.109)          
d86                   -0.586***         
                       (0.118)          
d87                   -0.854***         
                       (0.127)          
d88                   -1.192***         
                       (0.135)          
----------------------------------------
Observations             176            
R2                      0.623           
Adjusted R2             0.605           
========================================
Note:        *p<0.1; **p<0.05; ***p<0.01

Note that if you keep the intercept in the model and include a dummy for each time period except for the first (1980), the model will not run. This happens because the constant in a first-differences model already represents the difference between two years. This is not an issue if we are using factor(year) in our model instead of manually adding each time dummy. In this case, R will automatically drop an additional one of the year dummies or the constant (depending on the version of R that you have).

The estimate of \(\beta_1\) is \(-0.182\). The interpretation of the coefficients in a first-differences regression is exactly the same as that of the regular OLS coefficients. If \(\delta x_{it} = x_{it}-x_{it-1}\) increases by one unit, that is the same as a one-unit increase in \(x_{it}\). Therefore, we still think of the coefficients as the amount by which the dependent variable changes (\(y\)) if the independent variable (\(x\)) increases by one unit.

So, what is the effect of treatment?

# Store first differences coefficients
coeffs <- coefficients(out.fd1)

# Estimate treatment effect
(exp(coeffs[1])-1)*100

       ez 
-16.62966

The presence of an EZ causes a 16.6% (\((exp(-0.181878)-1)\times 100\%\) fall in unemployment claims. This is an economically large and statistically significant effect.

to top

3.3 Fixed effects

When we used the first-differences method, we lost our unobserved effect \(a_i\) because it is constant over time. An alternative way to do this is to use the fixed effects estimator. The unobserved heterogeneity, \(a_i\), is related to the city, not the period: if we consider all the cities in 1983, they will all have different \(a_i\); on the other hand, if we consider city A in all the years, the \(a_i\) will always be the same. Then, what we do is include dummy variables for each cross-sectional observation (in fact, R does it for us).

Strict Exogeneity Assumption: For the fixed effects estimator to be unbiased, there is a strict exogeneity assumption on the explanatory variables. This means that the idiosyncratic error term (\(u_{it}\)) should be uncorrelated with each explanatory variable across all time periods.

# Fixed effects
out.fe <-plm(  luclms ~ 0 + ez + d81 + d82 + d83 + d84 + d85 + d86 + d87 + d88
             , data  = tb.ez
             , index = c('city','year')
             , model = 'within')

stargazer( out.fe, type = "text"
         , omit.stat = c("ser", "f")
         , no.space = TRUE, header = FALSE
         ,  title = "Fixed effects results")


Fixed effects results
========================================
                 Dependent variable:    
             ---------------------------
                       luclms           
----------------------------------------
ez                     -0.104*          
                       (0.055)          
d81                   -0.322***         
                       (0.060)          
d82                    0.135**          
                       (0.060)          
d83                   -0.219***         
                       (0.060)          
d84                   -0.579***         
                       (0.062)          
d85                   -0.592***         
                       (0.065)          
d86                   -0.621***         
                       (0.065)          
d87                   -0.889***         
                       (0.065)          
d88                   -1.228***         
                       (0.065)          
----------------------------------------
Observations             198            
R2                      0.842           
Adjusted R2             0.813           
========================================
Note:        *p<0.1; **p<0.05; ***p<0.01

As with the first-differences, the interpretation of the fixed-effects coefficient is the same as with regular OLS. Note that, in this case, R would automatically drop the constant of the model.

We can achieve the same result by using “Pooled OLS” and manually adding an indicator for each city:

# Pooled ols with city fixed effects
out.pool <- plm(luclms ~ ez + d81 + d82 + d83 + d84 + d85 + d86 + d87 + d88 + factor(city)
                , data = tb.ez 
                , index = c('city','year')
                , model = 'pooling')

stargazer(out.pool, type = "text"
          , omit.stat = c("ser", "f")
          , omit = c('.*city.*') 
          , omit.labels=c('City Dummies')
          , no.space = TRUE
          , header = FALSE
          , title = "Pooled OLS (with city dummies) results")


Pooled OLS (with city dummies) results
========================================
                 Dependent variable:    
             ---------------------------
                       luclms           
----------------------------------------
ez                     -0.104*          
                       (0.055)          
d81                   -0.322***         
                       (0.060)          
d82                    0.135**          
                       (0.060)          
d83                   -0.219***         
                       (0.060)          
d84                   -0.579***         
                       (0.062)          
d85                   -0.592***         
                       (0.065)          
d86                   -0.621***         
                       (0.065)          
d87                   -0.889***         
                       (0.065)          
d88                   -1.228***         
                       (0.065)          
Constant              11.676***         
                       (0.080)          
----------------------------------------
City Dummies             Yes            
----------------------------------------
Observations             198            
R2                      0.933           
Adjusted R2             0.921           
========================================
Note:        *p<0.1; **p<0.05; ***p<0.01

to top

3.4 Summary

In short, we have three different methods that yield the following results:

# Output pooling, fd, and fe results
stargazer(out.pool, out.pl, out.fd1, out.fe
          , type = "text"
          , column.labels = c('Pool with dummies', 'Pool no city dummies', 'FD', 'FE')
          , omit.stat = c("ser", "f")
          # hide all variables started by "d" or with the word "city"
          , omit = c('d.*','.*city.*') 
          , omit.labels=c('Time Dummies', 'City Dummies')
          , model.names = FALSE
          , no.space = TRUE
          , header = FALSE
          , title = "Summary results")


Summary results
====================================================================
                               Dependent variable:                  
             -------------------------------------------------------
                                     luclms                         
             Pool with dummies Pool no city dummies    FD      FE   
                    (1)                (2)            (3)      (4)  
--------------------------------------------------------------------
ez                -0.104*             -0.039        -0.182** -0.104*
                  (0.055)            (0.115)        (0.078)  (0.055)
Constant         11.676***          11.694***                       
                  (0.080)            (0.125)                        
--------------------------------------------------------------------
Time Dummies        Yes                Yes            Yes      Yes  
City Dummies        Yes                 No             No      No   
--------------------------------------------------------------------
Observations        198                198            176      198  
R2                 0.933              0.354          0.623    0.842 
Adjusted R2        0.921              0.323          0.605    0.813 
====================================================================
Note:                                    *p<0.1; **p<0.05; ***p<0.01

First of all, the first differences estimator and the fixed-effects estimator are different. We know that the first differences and the fixed effect estimator are exactly the same when there are two time periods. When there are more time periods, the fixed effects estimator is more efficient if the classical assumptions hold. However, if there is severe serial correlation, first differencing may be better.

The fixed-effects estimator is the same as the pooled OLS with all the dummies (remember we added them to correct for the bias?). So, we can do it like this (using the dummies).

In fact, we could have ran all of these regressions using lm instead of plm, if we took the time to create the first-differences, or the demeaned, variables ourselves. plm saves us the trouble of doing so.

to top

4. Serial Correlation

4.1 What is serial correlation?

What is serial correlation?

The errors \(u_{it}\) are correlated over time.

What does it mean “they are correlated over time?”

Positive Serial Correlation: This means that if the error is positive in one period, you are more likely to have a positive error in the following period.
Negative Serial Correlation: This means that if the error is negative in one period, you are more likely to have a positive error in the following period.

Note: Regression errors are not observed and therefore we estimate them using the regression residuals \(\hat{u}_{it} = y_{it} - \hat{y}_{it}\)

Why can we have serial correlation in panel data?

Because the same unit of analysis (same individual, same city, or same state/country) shows up multiple times in our data.

What are the consequences of serial correlation?

The consequences of Positive Serial Correlation are:

Standard errors are underestimated
t-statistics are inflated
Type-I error increases (you incorrectly reject the null)

The consequences of Negative Serial Correlation are:

Standard errors are overstated
t-statistics are understated
Type-II error increases (you incorrectly do not reject the null)

Positive serial correlation is a more serious problem than negative serial correlation.

to top

4.2 Detecting serial correlation

We can easily test for the presence of serial correlation in our models. Lets test for serial correlation in our fixed effects model:

# Breusch–Godfrey Test of serial correlation
pbgtest(out.fe)


    Breusch-Godfrey/Wooldridge test for serial correlation in panel models

data:  luclms ~ 0 + ez + d81 + d82 + d83 + d84 + d85 + d86 + d87 + d88
chisq = 26.453, df = 9, p-value = 0.001722
alternative hypothesis: serial correlation in idiosyncratic errors

The p-value of the test is below 0.01, thus we reject the null hypothesis of no serial correlation in the fixed effects model. This means that we will need to adjust our model’s standard errors.

to top

4.3 Correcting SEs for Serial Correlation

In order to compute standard errors robust to serial correlation (and heteroskedasticity) we can use the following: vcov = plm::vcovHC(out.fe, method = "arellano")

# Correct SEs
robust_se    <- sqrt(diag(plm::vcovHC(out.fe, method = "arellano")))

# Correct SEs in stargazer
stargazer(  out.fe, out.fe
          , se = list(NULL, robust_se)
          , column.labels = c("Non-Robust SEs", "Robust SEs")
          , type = "text"
          , omit.stat = c("ser","f")
          , omit = c('d.*')
          , omit.labels = c('Time Dummies')
          , model.names = FALSE
          , dep.var.labels.include = FALSE
          , no.space = TRUE
          , header = FALSE
          , title = "Serial correlation results")


Serial correlation results
=========================================
                 Dependent variable:     
             ----------------------------
              Non-Robust SEs  Robust SEs 
                   (1)            (2)    
-----------------------------------------
ez               -0.104*        -0.104   
                 (0.055)        (0.069)  
-----------------------------------------
Time Dummies       Yes            Yes    
-----------------------------------------
Observations       198            198    
R2                0.842          0.842   
Adjusted R2       0.813          0.813   
=========================================
Note:         *p<0.1; **p<0.05; ***p<0.01

Note, that after the standard error correction the impact of the enterprise zones on unemployment claims is no longer statistically significant.

to top