Advanced Panel Data Models
Consider the table summarizing the results of the different models (Pooled OLS, FD, and FE) from Lab 8. Why is the number of observations different in the FD model? Carefully explain.
Test for the presence of serial correlation in the FE model. Interpret your results and, if necessary, re-estimate your FE model.
Create a graph plotting the values of the estimated \(\beta\) coefficients for each year-dummy
and their associated 95% confidence intervals. HINT: the graph should
have \(year\) on the x-axis and the
\(\beta\) values on the y-axis. Use the
function geom_errorbar() to create the confidence intervals
around each point.
Get started by loading libraries, reading data and estimating the panel models.
# Clean work environment
rm(list = ls()) # USE with CAUTION: this will delete everything in your environment
# Load packages
library(tidyverse)
library(stargazer)
library(ggthemes)
library(GGally)
library(skimr)
library(corrr)
library(knitr)
library(infer)
library(janitor)
library(plm)
library(lmtest)
# Load data, convert to tibble format, discard original
load("data/ezunem.RData")
tb.ez <- as_tibble(data)
rm(data)
# Pooled OLS
out.pl <- plm(luclms ~ ez + d81 + d82 + d83 + d84 + d85 + d86 + d87 + d88, data = tb.ez
, index = c('city','year')
, model = 'pooling')
# First differences (without intercept)
out.fd1 <- plm( luclms ~ 0 + ez + d81 + d82 + d83 + d84 + d85 + d86 + d87 + d88
, data = tb.ez
, index = c('city','year')
, model = 'fd')
# Fixed effects
out.fe <-plm( luclms ~ 0 + ez + d81 + d82 + d83 + d84 + d85 + d86 + d87 + d88
, data = tb.ez
, index = c('city','year')
, model = 'within')
1 Consider the table summarizing the results of the different models (Pooled OLS, FD, and FE) from lab 8. Why is the number of observations different in the FD model? Carefully explain.
# Output pooling, fd, and fe results
stargazer(out.pl, out.fd1, out.fe
, type = "text"
, column.labels = c('Pooled OLS', 'FD', 'FE')
, omit.stat = c("ser", "f")
# hide all variables started by "d" or with the word "city"
, omit = c('d.*','.*city.*')
, omit.labels=c('Time Dummies', 'City Dummies')
, model.names = FALSE
, no.space = TRUE
, header = FALSE
, title = "Summary results")
##
## Summary results
## ==========================================
## Dependent variable:
## -----------------------------
## luclms
## Pooled OLS FD FE
## (1) (2) (3)
## ------------------------------------------
## ez -0.039 -0.182** -0.104*
## (0.115) (0.078) (0.055)
## Constant 11.694***
## (0.125)
## ------------------------------------------
## Time Dummies Yes Yes Yes
## City Dummies No No No
## ------------------------------------------
## Observations 198 176 198
## R2 0.354 0.623 0.842
## Adjusted R2 0.323 0.605 0.813
## ==========================================
## Note: *p<0.1; **p<0.05; ***p<0.01
In first-differences estimation, we subtract the data from \(t-1\) to the data from \(t\). This results in the loss of one period of data (the first period) for all units. As such, as long as no missing data is present in the dataset, the number of observations in a FD regression model will be: number of units x number of time periods - number of units. In the EZ case this means: \(22 \times 9 - 22 = 176\).
2 Test for the presence of serial correlation in the FD model. Interpret your results and, if necessary, re-estimate your FD model.
# Breusch–Godfrey Test of serial correlation
pwfdtest(out.fd1, h0 = c("fd"))
##
## Wooldridge's first-difference test for serial correlation in panels
##
## data: out.fd1
## F = 7.8246, df1 = 1, df2 = 152, p-value = 0.005821
## alternative hypothesis: serial correlation in differenced errors
The p-value of the test is below 0.01 and thus we reject the null
hypothesis of no serial correlation in the FD model. We must correct the
standard errors of our FD model. We will use the
method = "arellano" to correct for both serial correlation
and heteroskedasticity.
# Adjust standard errors
cov1 <- plm::vcovHC(out.fd1, method = "arellano")
robust_se <- sqrt(diag(cov1))
# Correct SEs in stargazer
stargazer( out.fd1, out.fd1
, se = list(NULL, robust_se)
, type = "text"
, omit.stat = c("ser","f")
, omit = c('d.*')
, omit.labels = c('Time Dummies')
, model.names = FALSE
, dep.var.labels.include = FALSE
, no.space = TRUE
, header = FALSE
, title = "Serial correlation results"
, column.labels = c("Non-Robust SE", "Robust SEs"))
##
## Serial correlation results
## =========================================
## Dependent variable:
## ----------------------------
## Non-Robust SE Robust SEs
## (1) (2)
## -----------------------------------------
## ez -0.182** -0.182**
## (0.078) (0.088)
## -----------------------------------------
## Time Dummies Yes Yes
## -----------------------------------------
## Observations 176 176
## R2 0.623 0.623
## Adjusted R2 0.605 0.605
## =========================================
## Note: *p<0.1; **p<0.05; ***p<0.01
Even after the correction, the effect of EZ is significant at the level of 95%.
3 Create a graph plotting the values of the estimated \(\beta\) coefficients for each year-dummy and their associated 95% confidence intervals.
# Create coefficients and confident intervals data set
tb.plot <- tibble(dummies = coef(out.fd1)[2:length(coef(out.fd1))],
x = 81:88,
lb=confint(out.fd1)[2:length(coef(out.fd1)),1],
ub=confint(out.fd1)[2:length(coef(out.fd1)),2])
tb.plot
## # A tibble: 8 × 4
## dummies x lb ub
## <dbl> <int> <dbl> <dbl>
## 1 -0.322 81 -0.412 -0.231
## 2 0.135 82 0.00782 0.263
## 3 -0.219 83 -0.376 -0.0629
## 4 -0.558 84 -0.743 -0.373
## 5 -0.557 85 -0.770 -0.343
## 6 -0.586 86 -0.818 -0.354
## 7 -0.854 87 -1.10 -0.605
## 8 -1.19 88 -1.46 -0.928
# Create plot
ggplot(tb.plot, aes(x = x, y = dummies)) +
geom_line() +
geom_hline( yintercept = 0, col = 'Red', linetype = 'dashed') +
geom_errorbar( aes( x = x, ymin = lb, ymax = ub), width=0.1) +
theme_bw() + labs(x = 'year', y = 'Beta')