Advanced Panel Data Models

  1. Consider the table summarizing the results of the different models (Pooled OLS, FD, and FE) from Lab 8. Why is the number of observations different in the FD model? Carefully explain.

  2. Test for the presence of serial correlation in the FE model. Interpret your results and, if necessary, re-estimate your FE model.

  3. Create a graph plotting the values of the estimated \(\beta\) coefficients for each year-dummy and their associated 95% confidence intervals. HINT: the graph should have \(year\) on the x-axis and the \(\beta\) values on the y-axis. Use the function geom_errorbar() to create the confidence intervals around each point.


Get started by loading libraries, reading data and estimating the panel models.

# Clean work environment 
rm(list = ls()) # USE with CAUTION: this will delete everything in your environment

# Load packages
library(tidyverse)
library(stargazer)
library(ggthemes)
library(GGally)
library(skimr)
library(corrr)
library(knitr)
library(infer)
library(janitor)
library(plm)
library(lmtest)

# Load data, convert to tibble format, discard original
load("data/ezunem.RData")
tb.ez <- as_tibble(data)
rm(data)

# Pooled OLS
out.pl <- plm(luclms ~ ez + d81 + d82 + d83 + d84 + d85 + d86 + d87 + d88, data = tb.ez
              , index = c('city','year')
              , model = 'pooling')

# First differences (without intercept)
out.fd1 <- plm( luclms ~ 0 + ez + d81 + d82 + d83 + d84 + d85 + d86 + d87 + d88
              , data  = tb.ez
              , index = c('city','year')
              , model = 'fd')

# Fixed effects
out.fe <-plm(  luclms ~ 0 + ez + d81 + d82 + d83 + d84 + d85 + d86 + d87 + d88
             , data  = tb.ez
             , index = c('city','year')
             , model = 'within')


1 Consider the table summarizing the results of the different models (Pooled OLS, FD, and FE) from lab 8. Why is the number of observations different in the FD model? Carefully explain.

# Output pooling, fd, and fe results
stargazer(out.pl, out.fd1, out.fe
          , type = "text"
          , column.labels = c('Pooled OLS', 'FD', 'FE')
          , omit.stat = c("ser", "f")
          # hide all variables started by "d" or with the word "city"
          , omit = c('d.*','.*city.*') 
          , omit.labels=c('Time Dummies', 'City Dummies')
          , model.names = FALSE
          , no.space = TRUE
          , header = FALSE
          , title = "Summary results")
## 
## Summary results
## ==========================================
##                   Dependent variable:     
##              -----------------------------
##                         luclms            
##              Pooled OLS     FD       FE   
##                  (1)        (2)      (3)  
## ------------------------------------------
## ez             -0.039    -0.182**  -0.104*
##                (0.115)    (0.078)  (0.055)
## Constant      11.694***                   
##                (0.125)                    
## ------------------------------------------
## Time Dummies     Yes        Yes      Yes  
## City Dummies     No         No       No   
## ------------------------------------------
## Observations     198        176      198  
## R2              0.354      0.623    0.842 
## Adjusted R2     0.323      0.605    0.813 
## ==========================================
## Note:          *p<0.1; **p<0.05; ***p<0.01


In first-differences estimation, we subtract the data from \(t-1\) to the data from \(t\). This results in the loss of one period of data (the first period) for all units. As such, as long as no missing data is present in the dataset, the number of observations in a FD regression model will be: number of units x number of time periods - number of units. In the EZ case this means: \(22 \times 9 - 22 = 176\).


2 Test for the presence of serial correlation in the FD model. Interpret your results and, if necessary, re-estimate your FD model.

# Breusch–Godfrey Test of serial correlation
pwfdtest(out.fd1, h0 = c("fd"))
## 
##  Wooldridge's first-difference test for serial correlation in panels
## 
## data:  out.fd1
## F = 7.8246, df1 = 1, df2 = 152, p-value = 0.005821
## alternative hypothesis: serial correlation in differenced errors


The p-value of the test is below 0.01 and thus we reject the null hypothesis of no serial correlation in the FD model. We must correct the standard errors of our FD model. We will use the method = "arellano" to correct for both serial correlation and heteroskedasticity.


# Adjust standard errors
cov1         <- plm::vcovHC(out.fd1, method = "arellano")
robust_se    <- sqrt(diag(cov1))

# Correct SEs in stargazer
stargazer(  out.fd1, out.fd1
          , se = list(NULL, robust_se)
          , type = "text"
          , omit.stat = c("ser","f")
          , omit = c('d.*')
          , omit.labels = c('Time Dummies')
          , model.names = FALSE
          , dep.var.labels.include = FALSE
          , no.space = TRUE
          , header = FALSE
          , title = "Serial correlation results"
          , column.labels = c("Non-Robust SE", "Robust SEs"))
## 
## Serial correlation results
## =========================================
##                  Dependent variable:     
##              ----------------------------
##               Non-Robust SE   Robust SEs 
##                    (1)           (2)     
## -----------------------------------------
## ez              -0.182**       -0.182**  
##                  (0.078)       (0.088)   
## -----------------------------------------
## Time Dummies       Yes           Yes     
## -----------------------------------------
## Observations       176           176     
## R2                0.623         0.623    
## Adjusted R2       0.605         0.605    
## =========================================
## Note:         *p<0.1; **p<0.05; ***p<0.01


Even after the correction, the effect of EZ is significant at the level of 95%.


3 Create a graph plotting the values of the estimated \(\beta\) coefficients for each year-dummy and their associated 95% confidence intervals.

# Create coefficients and confident intervals data set
tb.plot <- tibble(dummies = coef(out.fd1)[2:length(coef(out.fd1))],
                  x = 81:88,
                 lb=confint(out.fd1)[2:length(coef(out.fd1)),1],
                 ub=confint(out.fd1)[2:length(coef(out.fd1)),2])
tb.plot
## # A tibble: 8 × 4
##   dummies     x       lb      ub
##     <dbl> <int>    <dbl>   <dbl>
## 1  -0.322    81 -0.412   -0.231 
## 2   0.135    82  0.00782  0.263 
## 3  -0.219    83 -0.376   -0.0629
## 4  -0.558    84 -0.743   -0.373 
## 5  -0.557    85 -0.770   -0.343 
## 6  -0.586    86 -0.818   -0.354 
## 7  -0.854    87 -1.10    -0.605 
## 8  -1.19     88 -1.46    -0.928
# Create plot
ggplot(tb.plot, aes(x = x, y = dummies)) +
  geom_line() + 
  geom_hline( yintercept = 0, col = 'Red', linetype = 'dashed') +
  geom_errorbar( aes( x = x, ymin = lb, ymax = ub), width=0.1) +
  theme_bw() + labs(x = 'year', y = 'Beta')