For this assignment I wanted to examine a very simple panel dataset that would enable flashing out the theory of mixed effects- both random and time effects. I deep-dive into the model specification and modeling decisions.
The causal question I am considering is whether within the ten firms mentioned in the dataset, investments and capital causes increase in firms value, in what way and how so. While Economic theory will indeed predict such retunrs, and economist might want to eliminat unobserved hetrogeneity and arrive at consistent unbiased coefficients that will explain in what way investment and capital can cause increase in firm’s value.
rm(list = ls())
library(foreign)
library(car)
library(gplots)
library(plm)
data("Grunfeld", package = "plm")
Panel <- pdata.frame(Grunfeld, c("firm","year")) # set panel structure
summary(Panel)
firm year inv value capital
1 :20 1935 : 10 Min. : 0.93 Min. : 58.12 Min. : 0.80
2 :20 1936 : 10 1st Qu.: 33.56 1st Qu.: 199.97 1st Qu.: 79.17
3 :20 1937 : 10 Median : 57.48 Median : 517.95 Median : 205.60
4 :20 1938 : 10 Mean : 145.96 Mean :1081.68 Mean : 276.02
5 :20 1939 : 10 3rd Qu.: 138.04 3rd Qu.:1679.85 3rd Qu.: 358.10
6 :20 1940 : 10 Max. :1486.70 Max. :6241.70 Max. :2226.30
(Other):80 (Other):140
# the same data set of 10 firms over 20 years as in last class' activity
formula <-value~inv +capital
# plot of all correlations / relatinoships:
plot(Panel)
In this work I will show how the theory comes into play with empirical work. The theory suggests that fixed effect models will preform better than pooled OLS. The pooled model does not take variations and some specifications into account. In other words, this model is agnostic to those variations and would treat each and every datapoint equaly. In this case, not only the pooled OLS model is inefficient and inconsistent, but it could actually even lead to wrong estimation of the sign of the coefficients, i.e. prediction of the wrong trend. The Fixed Effects (FE) and Random Effects (RE) model do take into consideration unobserved hetrogeneity and therefore are more accurate. Let’s observe the hetrogeneity:
# error bars show 95% confidence interval
plotmeans(value ~ firm, main="Heterogeneity across firms", data=Panel)
plotmeans(value ~ year, main="Heterogeneity across years", data=Panel)
ols<-lm(value~inv +capital, data = Panel)
summary(ols)
Call:
lm(formula = value ~ inv + capital, data = Panel)
Residuals:
Min 1Q Median 3Q Max
-2010.54 -339.50 -184.08 76.66 2707.84
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 410.8156 64.1419 6.405 1.08e-09 ***
inv 5.7598 0.2909 19.803 < 2e-16 ***
capital -0.6153 0.2095 -2.937 0.00371 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 666.5 on 197 degrees of freedom
Multiple R-squared: 0.7455, Adjusted R-squared: 0.7429
F-statistic: 288.5 on 2 and 197 DF, p-value: < 2.2e-16
While the coefficients are statistically significant, they might be biased as mentioned above.
The reason that we are using these models is that we assume some unobserved hetrogeneity that is not part of the idiosyncratic error term, and is correlated with one or more of the explanatory variables. First differences or FE,RE can correct for it. For example, in FE we can average out the time effect (by subtractign the time averaged values fron the original equation) and achieve an estimate that is not confounded by the unobserved hetrogeneity. When this is done there are two assumptions that need to hold in order for the model to be consistent:
We should remember that this model removes everything that is time-constant, whic could be a caveat.
With RE model, we assume that we have controlled for all factors that are relevant for the model. This means that we would make the assumption that the unobserved hetrogeneity correlated with the explanatory variables is miniscule. In this case, we could use OLS as explained above, as well as FE, but FE would be heavy guns used while it is not ncessary. In order to solve for the problem of serially-correlated errors for the pooled OLS, , we need to use the the FGLS, the Feasible Generalized LS estimator, which is a type of RE estimator.The RE model is a generalized version of the FE model with \(\lambda\) > 0 and < 1.
We can check the two models and keep in our mind that if the "true" model is RE, then we would have inconsistent coefficient estimates with the FE framework. If we were to go the other way around with RE framework while the true model is FE, we would have consistent but inefficient estimates. This means that we would approach the true population estimates slower. This fast is crucial to take it into consideration when specifying the model. Remember that random effects are estimated with partial pooling, while fixed effects are not. Another distinction worth mentioning is that we would have an intercept for the random effecrs model, but not for the fixed effects one, because it would not make sense to include those in our analysis.
fe <- plm(formula, data =Panel, model = 'within')
summary(fe)
Oneway (individual) effect Within Model
Call:
plm(formula = formula, data = Panel, model = "within")
Balanced Panel: n = 10, T = 20, N = 200
Residuals:
Min. 1st Qu. Median 3rd Qu. Max.
-808.00 -88.60 -7.23 76.20 1370.00
Coefficients:
Estimate Std. Error t-value Pr(>|t|)
inv 2.85617 0.30751 9.2879 < 2.2e-16 ***
capital -0.50787 0.14037 -3.6182 0.0003812 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Total Sum of Squares: 23078000
Residual Sum of Squares: 13577000
R-Squared: 0.41169
Adj. R-Squared: 0.37727
F-statistic: 65.7798 on 2 and 188 DF, p-value: < 2.22e-16
pFtest(value~inv +capital, data = Panel,
effect = "individual", model = "within")
F test for individual effects
data: value ~ inv + capital
F = 113.76, df1 = 9, df2 = 188, p-value < 2.2e-16
alternative hypothesis: significant effects
Since the P-Value is so small, we learn that the FE model is superior to the pooled model.
firm_year_fe<-lm(value~inv +capital +factor(firm)+ factor(year)-1, data = Panel)
summary(firm_year_fe)
Call:
lm(formula = value ~ inv + capital + factor(firm) + factor(year) -
1, data = Panel)
Residuals:
Min 1Q Median 3Q Max
-760.84 -105.29 4.94 129.14 1042.21
Coefficients:
Estimate Std. Error t value Pr(>|t|)
inv 2.5694 0.3002 8.560 6.65e-15 ***
capital -0.5885 0.1605 -3.666 0.000329 ***
factor(firm)1 2841.3257 147.5440 19.257 < 2e-16 ***
factor(firm)2 778.8056 129.7848 6.001 1.16e-08 ***
factor(firm)3 1602.1249 95.0532 16.855 < 2e-16 ***
factor(firm)4 231.4143 93.7647 2.468 0.014581 *
factor(firm)5 47.2631 103.1136 0.458 0.647283
factor(firm)6 27.0015 93.0555 0.290 0.772045
factor(firm)7 -99.0248 94.7843 -1.045 0.297636
factor(firm)8 299.2420 93.1847 3.211 0.001582 **
factor(firm)9 89.4670 94.5261 0.946 0.345255
factor(firm)10 -245.3673 94.5322 -2.596 0.010274 *
factor(year)1936 304.9451 108.3165 2.815 0.005452 **
factor(year)1937 540.9025 108.6010 4.981 1.55e-06 ***
factor(year)1938 172.6836 108.6597 1.589 0.113881
factor(year)1939 407.6552 108.8791 3.744 0.000248 ***
factor(year)1940 379.2306 108.5158 3.495 0.000606 ***
factor(year)1941 275.7897 108.8553 2.534 0.012201 *
factor(year)1942 128.3897 109.0554 1.177 0.240736
factor(year)1943 260.9976 109.2877 2.388 0.018035 *
factor(year)1944 284.5931 109.2154 2.606 0.009984 **
factor(year)1945 392.8830 109.3179 3.594 0.000427 ***
factor(year)1946 369.6591 109.5954 3.373 0.000922 ***
factor(year)1947 173.2198 111.2191 1.557 0.121231
factor(year)1948 149.5766 112.4949 1.330 0.185432
factor(year)1949 228.7342 114.6859 1.994 0.047712 *
factor(year)1950 269.7332 115.1108 2.343 0.020281 *
factor(year)1951 389.0969 114.4665 3.399 0.000843 ***
factor(year)1952 407.3494 116.5492 3.495 0.000605 ***
factor(year)1953 544.8882 119.5505 4.558 9.88e-06 ***
factor(year)1954 556.8600 123.8315 4.497 1.28e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 241.7 on 169 degrees of freedom
Multiple R-squared: 0.9829, Adjusted R-squared: 0.9798
F-statistic: 313.7 on 31 and 169 DF, p-value: < 2.2e-16
pFtest(value~inv +capital, data = Panel,
effect = "twoways", model = "within")
F test for twoways effects
data: value ~ inv + capital
F = 47.486, df1 = 28, df2 = 169, p-value < 2.2e-16
alternative hypothesis: significant effects
#ploting:
yhat <- firm_year_fe$fitted.values
scatterplot(yhat~Panel$capital|Panel$firm, boxplots=FALSE, main="regression per firm:" ,xlab="Capital", ylab="yhat",smooth=FALSE)
We see the huge variation and difference between OLS regression that would have taken all the data as is without differentiation between firms, and this regression that color-codes the different firms.
re <- plm(formula, data =Panel, model = 'random')
summary(re)
Oneway (individual) effect Random Effect Model
(Swamy-Arora's transformation)
Call:
plm(formula = formula, data = Panel, model = "random")
Balanced Panel: n = 10, T = 20, N = 200
Effects:
var std.dev share
idiosyncratic 72217.6 268.7 0.195
individual 298685.7 546.5 0.805
theta: 0.8907
Residuals:
Min. 1st Qu. Median 3rd Qu. Max.
-614.0 -121.0 -59.6 80.6 1610.0
Coefficients:
Estimate Std. Error t-value Pr(>|t|)
(Intercept) 786.90480 182.17147 4.3196 2.477e-05 ***
inv 3.11343 0.30761 10.1212 < 2.2e-16 ***
capital -0.57842 0.14247 -4.0599 7.079e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Total Sum of Squares: 26909000
Residual Sum of Squares: 15280000
R-Squared: 0.43217
Adj. R-Squared: 0.42641
F-statistic: 74.9683 on 2 and 197 DF, p-value: < 2.22e-16
we see the tiny p-value of the F-test, pointing at coefficients different than 0.
phtest(fe, re)
Hausman Test
data: formula
chisq = 2366.7, df = 2, p-value < 2.2e-16
alternative hypothesis: one model is inconsistent
Considering the Hausman test, the F-tests and the data we learn that the alternative hypothesis is statistically significant with the traditional .05 P-value. That is, the fixed effects model that in practice assigns 1 to the \(\lambda\) is superior. We need in this cae torely on the FE model. The economic interpretation is that each firm has to be considered within itself when estimating returns on capital and investments. On other words, we can say that this domain works case by case and there is no magic bullet. Indeed we learn that those predictors definitely have explanatory power, i.e. we can use them to generate usefull predictions; however, we will have troubles when we need to predict specific value per company because of the variation of returns.
Clearly, this explanatory analysis is not perfect. Internaly there could be hidden variable that would have been relevant to include and would improve the predictive power of the analysis, or refine the causal relationship that we derive. Nevertheless I think that it is indeed a useful analysis that points at some idiosyncrasy of this domain, which makes it so interesting. Externally, I belive that this study can serve as a useful point of reference for similar works. It means that the general lessons regarding the framework and applicability of the models is generalizable. Different domains have different behaviours and different heterogeneity so this will have to be taken into consideration.