Hypothesis Testing in Panel Data

class: center, middle, inverse, title-slide

.title[
# Hypothesis Testing in Panel Data
]
.author[
### Rogers Ochenge
]
.date[
### 2023/10 (updated: 2023-11-27)
]

---

## Introduction

- The double dimensionality of panel data allows for much richer specification than simple cross-sections or time series.

- This both a blessing and a curse, given how much more complicated the specification may become. In fact, all possible features from either cross-sections or time series can coexist with individual and time effects

- The specification problem of panel models is typically associated with the presence or absence of individual effects, i.e., the need to account for unobserved heterogeneity.

---

## Tests on Individual and/or Time Effects

- The pooled regression model is a restricted model where all entities are assumed to be homogenous. Therefore, the regression coefficients in the behavioral equation are the same over time and across entities.

- The unrestricted model, on the other hand, assumes that the entities are fully heterogeneous in a sense that the regression coefficients of the same behavioral equation are different across cross section units or time. The question of whether the parameters of a model vary over the years or across the cross section units.

- For unrestricted model, the regression equation for each cross section unit is
`$$y_{i}=X_{i}\beta_{i}+u_{i}$$`
--

- The coefficient `$\beta_{i}$` is different for every cross- section. Here, the following numm hypothesis is to be tested `$$H_{0}:\beta_{i}=\beta$$`

---

- The alternative hypothesis is that at least two parameters are not equal. Under the `$H_{0}$`, the model becomes restricted: `$$y_{i}=X_{i}\beta +u_{i}$$`

- If this assumption is true, and we want to pool the data across the cross- section units, then we can test for the poolability of the data using the Chow test as given below

---

# Chow Poolability Test

- An extension of the Chow test for structural change
- We examine if the slopes are the same for all the cross-section units or over time
- In a fixed or random effects regression the slopes remain constant; only the intercept (in fixed effects) and variances (in random effects) vary.

## Chow poolability test: same slopes across cross section units

- Step 1: Run separate time series regressions for each cross section unit and compute the sum of squared residuals (i.e., the unrestricted regressions)

- Step 2: Run pooled OLS using the full panel and compute the sum of squared residuals

- Step 3: Use the sums of squared residuals in Steps 1 and 2 to compute the F-statistic for poolability

---

## Chow poolability test: same slopes across years

- Step 1: Run separate cross section regressions for each year and compute the sum of squared residuals (i.e. the unrestricted regressions)

- Step 2: Run pooled OLS using the full panel and compute the sum of squared residuals

- Step 3: Use the sums of squared residuals in Steps 1 and 2 to compute the F-statistic for poolability

- F statistic `$$F=\dfrac{SSE_{r}-SSE_{u}/q}{SSE_{u}/df_{u}}$$` A  restricted model is one where at least some of the coefficients of the independent variables are assumed to be zero, and hence this would apply to any models with omitted variable.

---

The plm package in R provides a a function for the poolability test in just three steps:

# 1. Run a normal OLS model with fixed effects (model="within")
plm_model<- plm(y ~ x, data= dataset, model= "within")

# 2. Run a variable coefficients model with fixed effects (model="within")
pvcm_model<- pvcm(y ~ x, data= dataset, model= "within")

# 3. Run the poolability test
pooltest(plm_model, pvcm_model)

---

- The null hypothesis is that the dataset is poolable (i.e. individuals have the same slope coefficients), so if `$p<0.05$` you reject the null and you need a variable coefficients model.

## Grunfeld: Grunfeld's Investment Data

A balanced panel of 10 observational units (firms) from 1935 to 1954

A data frame containing :

firm

observation
year

date
inv

gross Investment
value

value of the firm
capital

stock of plant and equipment
---
# Required packages

```r
library(ggplot2)
library(dplyr)
```

```
## 
## Attaching package: 'dplyr'
```

```
## The following objects are masked from 'package:stats':
## 
##     filter, lag
```

```
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
```

```r
library(plm)
```

```
## 
## Attaching package: 'plm'
```

```
## The following objects are masked from 'package:dplyr':
## 
##     between, lag, lead
```

```r
#library(lfe)
library(lmtest)
```

```
## Warning: package 'lmtest' was built under R version 4.0.5
```

```
## Loading required package: zoo
```

```
## 
## Attaching package: 'zoo'
```

```
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
```

```r
library(car)
```

```
## Loading required package: carData
```

```
## 
## Attaching package: 'car'
```

```
## The following object is masked from 'package:dplyr':
## 
##     recode
```

```r
library(geepack)
library(knitr) #for `kable()`
library(AER)
```

```
## Loading required package: sandwich
```

```
## Loading required package: survival
```

```r
library(xtable)
```

---

```r
data("Grunfeld", package = "plm")
```

```r
Grunfeld%>%head()
```

```
##   firm year   inv  value capital
## 1    1 1935 317.6 3078.5     2.8
## 2    1 1936 391.8 4661.7    52.6
## 3    1 1937 410.6 5387.1   156.9
## 4    1 1938 257.7 2792.2   209.2
## 5    1 1939 330.8 4313.2   203.4
## 6    1 1940 461.2 4643.9   207.2
```

- A panel is usually denoted by having multiple entries (rows) for the same entity (firm, person, …) in the dataset. 
- The multiple entries are due to different time periods at which the entity was observed. Initially, I create a twoway table of the variables firm and year. 
- In total there are 10 different firms, each having one observation for year 1935 to year 1954. 
- Since no holes (zeros) can be found in this structure, that is no missing information in any given year for each firm, this panel is considered to be balanced.

---

```r
Grunfeld %>%
  select(year, firm) %>%
  table()
```

```
##       firm
## year   1 2 3 4 5 6 7 8 9 10
##   1935 1 1 1 1 1 1 1 1 1  1
##   1936 1 1 1 1 1 1 1 1 1  1
##   1937 1 1 1 1 1 1 1 1 1  1
##   1938 1 1 1 1 1 1 1 1 1  1
##   1939 1 1 1 1 1 1 1 1 1  1
##   1940 1 1 1 1 1 1 1 1 1  1
##   1941 1 1 1 1 1 1 1 1 1  1
##   1942 1 1 1 1 1 1 1 1 1  1
##   1943 1 1 1 1 1 1 1 1 1  1
##   1944 1 1 1 1 1 1 1 1 1  1
##   1945 1 1 1 1 1 1 1 1 1  1
##   1946 1 1 1 1 1 1 1 1 1  1
##   1947 1 1 1 1 1 1 1 1 1  1
##   1948 1 1 1 1 1 1 1 1 1  1
##   1949 1 1 1 1 1 1 1 1 1  1
##   1950 1 1 1 1 1 1 1 1 1  1
##   1951 1 1 1 1 1 1 1 1 1  1
##   1952 1 1 1 1 1 1 1 1 1  1
##   1953 1 1 1 1 1 1 1 1 1  1
##   1954 1 1 1 1 1 1 1 1 1  1
```

---

With package `plm` this can be examined with function `is.pbalanced()`.

```r
Grunfeld %>%
  is.pbalanced()
```

```
## [1] TRUE
```

---

# Visualization-Naive Plot

- A simple lineplot of gross investment (inv) over time (year) without taking into account the entity dimension does not give meaningful results!

---

```r
ggplot(data = Grunfeld, aes(x = year, y = inv)) +
  geom_line() +
  labs(x = "Year",  y = "Gross Investment") +
  theme(legend.position = "none")
```

![](Lec-3-Hypothesis-Testing_files/figure-html/unnamed-chunk-6-1.png)

---

# Separated Plot Line

Therefore, I create a line plot for each firm separated by colour. The additional blue dashed line indicates the overall trend in the data considering all firms simultaneously. It is in fact the fitted regression line of a linear model between inv and year. Firm 1 and 2 have a relatively high gross investment compared to the other firms. On average, gross investment increases over time.
---

```
## `geom_smooth()` using formula 'y ~ x'
```

![](Lec-3-Hypothesis-Testing_files/figure-html/unnamed-chunk-7-1.png)

---

# Entity Heterogeneity

- Heterogeneity across firms can be shown with a line plot. The blue line connects the mean values of inv, using all available years across firms (entities).

---

```
## Joining, by = "firm"
```

![](Lec-3-Hypothesis-Testing_files/figure-html/unnamed-chunk-8-1.png)

---

# Pooled OLS

```r
pooled_ols <- plm(inv ~ capital, data = Grunfeld, 
                      index = c("firm", "year"), 
                      effect = "individual", model = "pooling")

kable(tidy(pooled_ols), digits=3, 
      caption="POLS model")
```

Table: POLS model

|term        | estimate| std.error| statistic| p.value|
|:-----------|--------:|---------:|---------:|-------:|
|(Intercept) |   14.236|    15.639|     0.910|   0.364|
|capital     |    0.477|     0.038|    12.447|   0.000|

---

# Scatter plot
With a scatterplot it is easy to see that, although firms could be distinguished by the variable firm, OLS estimation treats all observations as if they come from different entities and fits the regression line accordingly.

---

```
## `geom_smooth()` using formula 'y ~ x'
```

![](Lec-3-Hypothesis-Testing_files/figure-html/unnamed-chunk-10-1.png)

---

# Fixed Effects Model

- The fixed effects (FE) model, also called within estimator or least squares dummy variable (LSDV) model, is commonly applied to remove omitted variable bias. 
- By estimating changes within a specific group (over time) all time-invariant differences between entities (individuals, firms, …) are controlled for. For example:

+ the unobserved ability of the management influencing the firm’s revenue

+ or the skills influencing an employee’s wage .

---

- The assumption behind the FE model is that something influences the independent variables and one needs to control for it (the error term and independent variables are correlated). 
- Hence, the FE model removes characteristics that do not change over time, leading to unbiased estimates of the remaining regressors on the dependent variable. 
- If unobserved characteristics do not change over time, each change in the dependent variable must be due to influences not related to the fixed effects, which are controlled for. The FE model is hence suited for investigating causal relationships.
- Note that the influence of time-invariant regressors on the dependent variable cannot be examined with a FE model. Also they do not work well with data with low within-variance or variables which only change slowly over time.

---

- With function `lm()` a FE model can be estimated by including dummy variables for all firms. This is the so called least squares dummy variable (LSDV) approach. 
- We have shown before that the factor variable firm uniquely identifies each firm in the dataset. Similarly to the pooled OLS model, we are regressing `inv` on capital. 
- If there is a large number of individuals, the LSDV method is expensive from a computational point of view.

---

```r
fe_model_lm <- lm(inv ~ capital + factor(firm), data = Grunfeld)

kable(tidy(fe_model_lm), digits=3, caption = "LSDV Model")
```

Table: LSDV Model

|term           | estimate| std.error| statistic| p.value|
|:--------------|--------:|---------:|---------:|-------:|
|(Intercept)    |  367.613|    18.967|    19.382|   0.000|
|capital        |    0.371|     0.019|    19.143|   0.000|
|factor(firm)2  |  -66.455|    21.236|    -3.129|   0.002|
|factor(firm)3  | -413.682|    20.668|   -20.015|   0.000|
|factor(firm)4  | -326.441|    22.546|   -14.479|   0.000|
|factor(firm)5  | -486.278|    20.344|   -23.903|   0.000|
|factor(firm)6  | -350.866|    22.697|   -15.459|   0.000|
|factor(firm)7  | -436.783|    21.114|   -20.687|   0.000|
|factor(firm)8  | -356.472|    22.866|   -15.589|   0.000|
|factor(firm)9  | -436.170|    21.217|   -20.558|   0.000|
|factor(firm)10 | -366.731|    23.641|   -15.512|   0.000|

- Remember that one firm dummy variable is dropped to avoid the dummy variable trap.
---

- Next up, We calculate the same model but drop the constant (intercept) by adding -1 to the formula, so that no coeffcient (level) of firm is excluded. Note that this does not alter the coefficient estimate of capital!

```r
fe_model_lm_nocons <- lm(inv ~ capital + factor(firm) -1, data = Grunfeld)
kable(tidy(fe_model_lm_nocons ), digits=3, caption = "LSDV (Noconstant) Model")
```

Table: LSDV (Noconstant) Model

|term           | estimate| std.error| statistic| p.value|
|:--------------|--------:|---------:|---------:|-------:|
|capital        |    0.371|     0.019|    19.143|   0.000|
|factor(firm)1  |  367.613|    18.967|    19.382|   0.000|
|factor(firm)2  |  301.158|    15.318|    19.660|   0.000|
|factor(firm)3  |  -46.069|    16.189|    -2.846|   0.005|
|factor(firm)4  |   41.172|    14.406|     2.858|   0.005|
|factor(firm)5  | -118.665|    17.056|    -6.957|   0.000|
|factor(firm)6  |   16.747|    14.357|     1.167|   0.245|
|factor(firm)7  |  -69.170|    15.467|    -4.472|   0.000|
|factor(firm)8  |   11.141|    14.310|     0.778|   0.437|
|factor(firm)9  |  -68.557|    15.340|    -4.469|   0.000|
|factor(firm)10 |    0.882|    14.214|     0.062|   0.951|

---

- Due to the introduction of firm dummy variables each firm has its own intercept with the y axis! For comparison, We plot the fitted values from the pooled OLS model (blue dashed line). 
- Its slope is more steep compared to the LSDV approach as influential observations of firm 1 lead to an upward bias.

![](Lec-3-Hypothesis-Testing_files/figure-html/unnamed-chunk-13-1.png)

---

- The same coefficient estimates as with the LSDV approach can be computed with function `plm()`. 
- The argument `model=` is now set to `"within"`. 
- This is the within estimator with n entity-specific intercepts.

---

```r
fe_model_plm <- plm(inv ~ capital, data = Grunfeld, 
                    index = c("firm", "year"), 
                    effect = "individual", model = "within")

summary(fe_model_plm)
```

```
## Oneway (individual) effect Within Model
## 
## Call:
## plm(formula = inv ~ capital, data = Grunfeld, effect = "individual", 
##     model = "within", index = c("firm", "year"))
## 
## Balanced Panel: n = 10, T = 20, N = 200
## 
## Residuals:
##       Min.    1st Qu.     Median    3rd Qu.       Max. 
## -190.71466  -20.83474   -0.45862   21.38262  293.68714 
## 
## Coefficients:
##         Estimate Std. Error t-value  Pr(>|t|)    
## capital 0.370750   0.019368  19.143 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Total Sum of Squares:    2244400
## Residual Sum of Squares: 763680
## R-Squared:      0.65973
## Adj. R-Squared: 0.64173
## F-statistic: 366.446 on 1 and 189 DF, p-value: < 2.22e-16
```

---

- The coefficient of capital indicates how much inv changes over time, on average per country, when capital increases by one unit.

- With function `fixef()` the fixed effects, i.e. the constants for each firm, can be extracted. Compare them with the coefficients of the LSDV approach (w/o the consant) - they must be identical.

```r
fixef(fe_model_plm)
```

```
##          1          2          3          4          5          6          7 
##  367.61297  301.15762  -46.06917   41.17196 -118.66544   16.74738  -69.17024 
##          8          9         10 
##   11.14050  -68.55731    0.88169
```

---

### Poolability test (Testing for fixed effects)
- The one-way error component fixed effects model without intercept term is specified as: `$$y_{it}=\mu_{i}+x_{it}^{\prime}\beta+\epsilon_{it}$$`
- In testing for the validity of the fixed effect, we could test the joint significance of the dummies by perfoming an F test:
`$$H_{0}:\mu_{i}=\mu_{2}=...=\mu_{N}=0$$` `$$H_{1}:\mu_{i}\neq 0$$`
- The null hypothesis supports the pooled regression with panel data. The alternative favors the fixed effect
- The F-test compares the fixed effects model with the pooled regression model by calculating how the goodness of fit changes because of the restriction imposed in `$H_{0}$`.

---

- With function pFtest() one can test for fixed effects with the null hypothesis that pooled OLS is better than FE. 
- Alternatively, this test can be carried out by jointly assessing the significance of the dummy variables in the LSDV approach. The results are identical.

```r
# Within estimator vs. Pooled OLS
pFtest(fe_model_plm, pooled_ols)
```

```
## 
## 	F test for individual effects
## 
## data:  inv ~ capital
## F = 123.39, df1 = 9, df2 = 189, p-value < 2.2e-16
## alternative hypothesis: significant effects
```

---

## First-difference Estimator

- There is another way of estimating a FE model by specifying model `= "fd"` in function `plm().`

---

```r
fe_model_fd<- plm(inv ~ capital -1, data = Grunfeld,
                  index = c("firm", "year"), 
                  effect = "individual", model = "fd")

summary(fe_model_fd)
```

```
## Oneway (individual) effect First-Difference Model
## 
## Call:
## plm(formula = inv ~ capital - 1, data = Grunfeld, effect = "individual", 
##     model = "fd", index = c("firm", "year"))
## 
## Balanced Panel: n = 10, T = 20, N = 200
## Observations used in estimation: 190
## 
## Residuals:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  -240.4   -11.7     0.1     3.5    12.6   333.2 
## 
## Coefficients:
##         Estimate Std. Error t-value Pr(>|t|)    
## capital 0.230780   0.059639  3.8696  0.00015 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Total Sum of Squares:    584410
## Residual Sum of Squares: 561210
## R-Squared:      0.04476
## Adj. R-Squared: 0.04476
## F-statistic: 14.9739 on 1 and 189 DF, p-value: 0.00014998
```

---

- The coefficient of capital is now different compared to the LSDV approach and within-groups estimator. This is because the coefficients and standard errors of the first-differenced model are only identical to the previously obtained results when there are two time periods.
- For longer time series, both the coefficients and the standard errors will be different.

## Two Time Periods

- Let’s verify the former assumption by dropping all years except 1935 and 1936 from the Grunfeld dataset and estimate the model again (also for the within model).

---

```r
# Within estimation (two periods)
fe_model_plm_check <- plm(inv ~ capital, 
                          data = Grunfeld, 
                          subset = year %in% c(1935, 1936), 
                          index = c("firm", "year"), 
                          effect = "individual", model = "within")

lmtest::coeftest(fe_model_plm_check)
```

```
## 
## t test of coefficients:
## 
##         Estimate Std. Error t value Pr(>|t|)
## capital  0.91353    0.85333  1.0705   0.3122
```

---

```r
# FD estimation (two periods)
fe_model_fd_check<- plm(inv ~ capital -1,
                        data = Grunfeld, 
                        subset = year %in% c(1935, 1936), 
                        index = c("firm", "year"), 
                        effect = "individual", model = "fd")

lmtest::coeftest(fe_model_fd_check)
```

```
## 
## t test of coefficients:
## 
##         Estimate Std. Error t value Pr(>|t|)
## capital  0.91353    0.85333  1.0705   0.3122
```

---

## Random Effects Model

- The RE model (also called Partial Pooling Model) assumes, in contrast to the FE model, that any variation between entities is random and not correlated with the regressors used in the estimation model. - If there are reasons to believe that differences between entities influence the dependent variable, a RE model should be preferred. 
- This also means that time-invariant variables (like a person’s gender) can be taken into account as regressors. The entity’s error term (unobserved heterogeneity) is hence not correlated with the regressors.
- To break down the difference between FE and RE:

+ the FE model assumes that an individual (entity) specific effect is correlated with the independent        variables,
    + while the RE model assumes that an individual (entity) specific effect is not correlated with the        independent variables.
  - With function plm() the RE model can be estimated. The argument `model=` is set to value `"random"`.

---

```r
re_model_plm <- plm(inv ~ capital, data = Grunfeld, 
                    index = c("firm", "year"), 
                    effect = "individual", model = "random")

summary(re_model_plm)
```

```
## Oneway (individual) effect Random Effect Model 
##    (Swamy-Arora's transformation)
## 
## Call:
## plm(formula = inv ~ capital, data = Grunfeld, effect = "individual", 
##     model = "random", index = c("firm", "year"))
## 
## Balanced Panel: n = 10, T = 20, N = 200
## 
## Effects:
##                    var  std.dev share
## idiosyncratic  4040.63    63.57 0.135
## individual    25949.52   161.09 0.865
## theta: 0.9121
## 
## Residuals:
##      Min.   1st Qu.    Median   3rd Qu.      Max. 
## -164.0821  -22.2955   -3.7463   16.9121  319.9564 
## 
## Coefficients:
##              Estimate Std. Error z-value Pr(>|z|)    
## (Intercept) 43.246697  51.411319  0.8412   0.4002    
## capital      0.372120   0.019316 19.2652   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Total Sum of Squares:    2299300
## Residual Sum of Squares: 799910
## R-Squared:      0.65211
## Adj. R-Squared: 0.65036
## Chisq: 371.149 on 1 DF, p-value: < 2.22e-16
```

---

The coefficients in the RE model include both the within-entity and between-entity effects. When having data with multiple entities and time periods the coefficient of capital represents the average effect on inv when capital changes across years and between firms by one unit.

---

## Including the Time Dimension

- The pooled OLS model may be enhanced with the time dimension by including appropriate dummy variables. - In the `Grunfeld` dataset the factor variable year contains information for the time dimension. - Remember that one level of a factor variable will be held out when estimating the model. By controlling for year the coefficient of capital changes compared to the initial pooled OLS model.

---

```r
pooled_ols_time_plm <- plm(inv ~ capital + factor(year), data = Grunfeld, 
  index = c("firm", "year"), effect = "individual",
        model = "pooling")

kable(tidy(pooled_ols_time_plm), digits=3, Caption= "POLS with Time Effects")
```

|term             | estimate| std.error| statistic| p.value|
|:----------------|--------:|---------:|---------:|-------:|
|(Intercept)      |   39.207|    52.867|     0.742|   0.459|
|capital          |    0.538|     0.046|    11.590|   0.000|
|factor(year)1936 |   22.461|    74.656|     0.301|   0.764|
|factor(year)1937 |   27.899|    74.677|     0.374|   0.709|
|factor(year)1938 |  -36.689|    74.739|    -0.491|   0.624|
|factor(year)1939 |  -42.401|    74.779|    -0.567|   0.571|
|factor(year)1940 |  -11.429|    74.788|    -0.153|   0.879|
|factor(year)1941 |    5.330|    74.843|     0.071|   0.943|
|factor(year)1942 |  -26.252|    74.942|    -0.350|   0.727|
|factor(year)1943 |  -36.399|    74.984|    -0.485|   0.628|
|factor(year)1944 |  -32.389|    74.977|    -0.432|   0.666|
|factor(year)1945 |  -33.057|    75.009|    -0.441|   0.660|
|factor(year)1946 |   -3.631|    75.077|    -0.048|   0.961|
|factor(year)1947 |  -57.808|    75.520|    -0.765|   0.445|
|factor(year)1948 |  -73.111|    75.832|    -0.964|   0.336|
|factor(year)1949 | -106.844|    76.137|    -1.403|   0.162|
|factor(year)1950 | -105.875|    76.326|    -1.387|   0.167|
|factor(year)1951 |  -69.251|    76.547|    -0.905|   0.367|
|factor(year)1952 |  -76.610|    77.200|    -0.992|   0.322|
|factor(year)1953 |  -67.677|    78.217|    -0.865|   0.388|
|factor(year)1954 | -112.634|    79.408|    -1.418|   0.158|

---

- For the LSDV approach the time dimension is added in the same fashion as with pooled OLS.

```r
lsdv_time_lm <- lm(inv ~ capital + factor(firm) + factor(year), 
                 data = Grunfeld)

kable(tidy(lsdv_time_lm), digits=3)
```

|term             | estimate| std.error| statistic| p.value|
|:----------------|--------:|---------:|---------:|-------:|
|(Intercept)      |  354.917|    26.085|    13.606|   0.000|
|capital          |    0.414|     0.026|    15.929|   0.000|
|factor(firm)2    |  -51.233|    21.579|    -2.374|   0.019|
|factor(firm)3    | -402.993|    20.564|   -19.597|   0.000|
|factor(firm)4    | -303.744|    23.851|   -12.735|   0.000|
|factor(firm)5    | -479.318|    19.973|   -23.998|   0.000|
|factor(firm)6    | -327.439|    24.106|   -13.583|   0.000|
|factor(firm)7    | -422.426|    21.362|   -19.774|   0.000|
|factor(firm)8    | -332.243|    24.394|   -13.620|   0.000|
|factor(firm)9    | -421.079|    21.546|   -19.543|   0.000|
|factor(firm)10   | -339.071|    25.688|   -13.200|   0.000|
|factor(year)1936 |   23.940|    27.617|     0.867|   0.387|
|factor(year)1937 |   32.948|    27.635|     1.192|   0.235|
|factor(year)1938 |  -27.093|    27.688|    -0.979|   0.329|
|factor(year)1939 |  -30.798|    27.721|    -1.111|   0.268|
|factor(year)1940 |    0.583|    27.729|     0.021|   0.983|
|factor(year)1941 |   19.584|    27.775|     0.705|   0.482|
|factor(year)1942 |   -8.639|    27.859|    -0.310|   0.757|
|factor(year)1943 |  -17.568|    27.893|    -0.630|   0.530|
|factor(year)1944 |  -13.759|    27.887|    -0.493|   0.622|
|factor(year)1945 |  -13.525|    27.914|    -0.485|   0.629|
|factor(year)1946 |   17.699|    27.972|     0.633|   0.528|
|factor(year)1947 |  -27.241|    28.343|    -0.961|   0.338|
|factor(year)1948 |  -37.430|    28.602|    -1.309|   0.192|
|factor(year)1949 |  -66.762|    28.854|    -2.314|   0.022|
|factor(year)1950 |  -63.286|    29.011|    -2.181|   0.031|
|factor(year)1951 |  -23.910|    29.192|    -0.819|   0.414|
|factor(year)1952 |  -23.914|    29.725|    -0.805|   0.422|
|factor(year)1953 |   -5.127|    30.546|    -0.168|   0.867|
|factor(year)1954 |  -40.105|    31.492|    -1.273|   0.205|

---

```r
fe_time_plm <- plm(inv ~ capital + factor(year), data = Grunfeld, 
  index = c("firm", "year"), effect = "individual",
        model = "within")

kable(tidy(fe_time_plm ), digits=3)
```

|term             | estimate| std.error| statistic| p.value|
|:----------------|--------:|---------:|---------:|-------:|
|capital          |    0.414|     0.026|    15.929|   0.000|
|factor(year)1936 |   23.940|    27.617|     0.867|   0.387|
|factor(year)1937 |   32.948|    27.635|     1.192|   0.235|
|factor(year)1938 |  -27.093|    27.688|    -0.979|   0.329|
|factor(year)1939 |  -30.798|    27.721|    -1.111|   0.268|
|factor(year)1940 |    0.583|    27.729|     0.021|   0.983|
|factor(year)1941 |   19.584|    27.775|     0.705|   0.482|
|factor(year)1942 |   -8.639|    27.859|    -0.310|   0.757|
|factor(year)1943 |  -17.568|    27.893|    -0.630|   0.530|
|factor(year)1944 |  -13.759|    27.887|    -0.493|   0.622|
|factor(year)1945 |  -13.525|    27.914|    -0.485|   0.629|
|factor(year)1946 |   17.699|    27.972|     0.633|   0.528|
|factor(year)1947 |  -27.241|    28.343|    -0.961|   0.338|
|factor(year)1948 |  -37.430|    28.602|    -1.309|   0.192|
|factor(year)1949 |  -66.762|    28.854|    -2.314|   0.022|
|factor(year)1950 |  -63.286|    29.011|    -2.181|   0.031|
|factor(year)1951 |  -23.910|    29.192|    -0.819|   0.414|
|factor(year)1952 |  -23.914|    29.725|    -0.805|   0.422|
|factor(year)1953 |   -5.127|    30.546|    -0.168|   0.867|
|factor(year)1954 |  -40.105|    31.492|    -1.273|   0.205|

---

# Testing for Time FE

- There is also a possibility to test whether time fixed effects are needed. 
- The null hypothesis is that the coefficients are together zero for all years and hence no time fixed effects need to be taken into account. 
- For the within model this can be tested with function `pFtest()` which we have already used for testing for the presence of individual fixed effects  
- The model with time FE is compared to the one without. 
- You may also use the LSDV model and test the joint hypothesis that all coefficients of variable year are together zero. The results are identical.

---

```r
# Within model
pFtest(fe_time_plm, fe_model_plm) 
```

```
## 
## 	F test for individual effects
## 
## data:  inv ~ capital + factor(year)
## F = 1.594, df1 = 19, df2 = 170, p-value = 0.06242
## alternative hypothesis: significant effects
```

- There is evidence that time fixed effects should be taken into account (pvalue `$>0.05$`)

---

# Testing for Random Effects

- The random effects model assumes that the unobserved entity-specific heterogeneity is random and incorprates its effect into the model by exploiting the distribution of it. Therefore, random effect is measured by the variance of individual effects `$\mu_{i}$` or time effect `$\lambda_{t}$`

- Consider the following model `$$y_{it}=x_{it}^{\prime}\beta+u_{it}$$`
where `$u_{it}=\mu_{i}+\epsilon_{it}$`

- In a random effects model, `$\mu_{i}$` is assumed to be random and `$u_{it}$` is a composite error.
- The test for random effect involves the following null and alternative:
`$$H_{0}:\sigma_{\mu}^{2}=0$$` `$$H_{1}:\sigma_{\mu}^{2}\neq 0$$` Here, `$\sigma_{\mu}^{2}$` is the variance of the distribution of unobserved random effect.
- To test this hypothesis, we can use the Lagrange multiplier (LM) test developed by Breusch and Pagan (1980).

---

- The Breusch-Pagan Lagrange multiplier (LM) Test helps to decide between a random effects model and a simple OLS regression. 
- This test is implemented in function `plmtest()` with the null hypothesis that the variance across entities is zero. In this setting this means that there are no significant differences across firms (no panel effect).

```r
plmtest(pooled_ols, effect = "individual", type = c("bp"))
```

```
## 
## 	Lagrange Multiplier Test - (Breusch-Pagan)
## 
## data:  inv ~ capital
## chisq = 1285.1, df = 1, p-value < 2.2e-16
## alternative hypothesis: significant effects
```

- The test shows that there are significant differences across firms. Running a pooled OLS regression is thus not appropriate and the RE model is the better choice.

---

## Fixed or Random Effect: Hausman Test

- To make a decision whether the fixed effect or the random effect is best fitted in a panel, we use the Hausman (1978) specification test.
- The null hypothesis of this test is that individual effects are uncorrelated with any regressor in the model. 
- In other words, the null hypothesis in Hausman (1978)  test is that the preferred model is random effects against the alternative of the fixed effects
- The Hausman specification test basically tests whether the errors ($u_{it}$) are correlated with the regressors.

`$$H_{0}: cov(u_{it}x_{it})=0$$` 
         `$$H_{1}: cov(u_{it}x_{it})\neq 0$$`

---

- Whenever there is a clear idea that individual characteristics of each entity or group affect the regressors, use fixed effects.
- For example, macroeconomic data collected for most countries over time. There might be a good reason to believe that countries' economic performance may be affected by their own internal characteristics: type of government, political environment, cultural characteristics, type of public policies, etc.
- Random effects is used whenever there is reason to believe that individual characteristics have no effect on the regressors (uncorrelated)

- The test is implemented in function `phtest()`.

```r
phtest(fe_model_plm, re_model_plm)
```

```
## 
## 	Hausman Test
## 
## data:  inv ~ capital
## chisq = 0.93423, df = 1, p-value = 0.3338
## alternative hypothesis: one model is inconsistent
```

- The null hypothesis cannot be rejected here, hence we should use a RE model.

---

# Testing for Heteroskedasticity

- Testing for the presence of heteroskedasticity is also possible in panel settings.The null hypothesis of the Breusch-Pagan test against heteroskedasticity is that homoskedasticity is present. The test is implemented in function `bptest()` in package `lmtest`.

```r
lmtest::bptest(inv ~ capital + factor(firm), 
               studentize = F, data = Grunfeld)
```

```
## 
## 	Breusch-Pagan test
## 
## data:  inv ~ capital + factor(firm)
## BP = 386.81, df = 10, p-value < 2.2e-16
```

- There is strong evidence for the presense of heteroskedasticity. Hence, the use of robust standard errors is advised.

---

# Testing for Serial Correlation

- Since the Grunfeld dataset is “20 years long,” a test for serial correlation of the residuals should be performed. Serial correlation leads to an underestimation of standard errors (too small) and an overestimation of R2 (too large). 
- The Breusch-Godfrey/Wooldridge test for serial correlation in panel models is implemented in function `pbgtest()` with the null hypothesis that there is no serial correlation.

```r
pbgtest(fe_model_plm)
```

```
## 
## 	Breusch-Godfrey/Wooldridge test for serial correlation in panel models
## 
## data:  inv ~ capital
## chisq = 73.785, df = 20, p-value = 4.338e-08
## alternative hypothesis: serial correlation in idiosyncratic errors
```

- There is strong evidence that the residuals are serially correlated.
---

- If the error terms of different observations from the same entity are correlated, the standard errors have to be adjusted. In the Grunfdeld dataset a firm is observed at 20 different time points and the observations for the same individuals are hence not independent. 
- This leads to the previously described problem of serial correlation of the residuals. In order to solve this issue clustered standard errors have to be used. Both standard OLS as well as heteroskedasticity robust standard errors are wrong because they assume that the error terms `$u_{i,t}$` are not serially correlated. 
- Hence they underestimate the true sampling uncertainty (there is less random variation when error terms are correlated). Clustered standard errors estimate the variance of the coefficient when regressors are i.i.d. across entities but correlated within the entity.

- This issue is not restricted to panel data and can also occur in cross-sectional studies, if the data contains clusters of observations and their error terms are correlated within but not between the clusters (regions, schools, branches, …).

---

- In order to correct the standard errors function `vcovHC()` is used which originates from package `sandwich` but is also available in package `plm`. 
- The argument `cluster =` is set to "group" and the argument `type=` controls the estimation type of the standard errors. 
- I am using "sss" which includes the small sample correction method as applied by Stata (so it is easy to check the results with another software).

```r
# OLS standard error
coeftest(pooled_ols)[2,c(1,2)]
```

```
##   Estimate Std. Error 
##  0.4772241  0.0383394
```

```r
# Cluster robust standard error
coeftest(pooled_ols,
         vcov = vcovHC(pooled_ols,
                       type = "sss", 
                       cluster = "group"))[2,c(1,2)]  
```

```
##   Estimate Std. Error 
##  0.4772241  0.1330129
```

---

```r
# FE standard error
coeftest(fe_model_plm)[,c(1,2)]
```

```
##   Estimate Std. Error 
## 0.37074963 0.01936761
```

```r
# Cluster robust standard error
coeftest(fe_model_plm, 
         vcov = vcovHC(fe_model_plm,
                       type = "sss",
                       cluster = "group"))[,c(1,2)]
```

```
##   Estimate Std. Error 
## 0.37074963 0.06494565
```

---

```r
coeftest(re_model_plm)[2,c(1,2)]
```

```
##   Estimate Std. Error 
## 0.37212019 0.01931563
```

```r
coeftest(re_model_plm, 
         vcov = vcovHC(re_model_plm,
                       type = "sss",
                       cluster = "group"))[2,c(1,2)]
```

```
##   Estimate Std. Error 
## 0.37212019 0.06580301
```