In this discussion, I will be using the ChickWeight
pre-installed dataset. This is a panel dataset which tracks the weights
of chicken over a period of 21 days, providing additional information on
their diet and other variables.
data <- ChickWeight
head(ChickWeight,n=20)
## Grouped Data: weight ~ Time | Chick
## weight Time Chick Diet
## 1 42 0 1 1
## 2 51 2 1 1
## 3 59 4 1 1
## 4 64 6 1 1
## 5 76 8 1 1
## 6 93 10 1 1
## 7 106 12 1 1
## 8 125 14 1 1
## 9 149 16 1 1
## 10 171 18 1 1
## 11 199 20 1 1
## 12 205 21 1 1
## 13 40 0 2 1
## 14 49 2 2 1
## 15 58 4 2 1
## 16 72 6 2 1
## 17 84 8 2 1
## 18 103 10 2 1
## 19 122 12 2 1
## 20 138 14 2 1
summary(ChickWeight)
## weight Time Chick Diet
## Min. : 35.0 Min. : 0.00 13 : 12 1:220
## 1st Qu.: 63.0 1st Qu.: 4.00 9 : 12 2:120
## Median :103.0 Median :10.00 20 : 12 3:120
## Mean :121.8 Mean :10.72 10 : 12 4:118
## 3rd Qu.:163.8 3rd Qu.:16.00 17 : 12
## Max. :373.0 Max. :21.00 19 : 12
## (Other):506
freq <- table(ChickWeight$Chick)
print(freq)
##
## 18 16 15 13 9 20 10 8 17 19 4 6 11 3 1 12 2 5 14 7 24 30 22 23 27 28
## 2 7 8 12 12 12 12 11 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12
## 26 25 29 21 33 37 36 31 39 38 32 40 34 35 44 45 43 41 47 49 46 50 42 48
## 12 12 12 12 12 12 12 12 12 12 12 12 12 12 10 12 12 12 12 12 12 12 12 12
According to some basic EDA that can be seen above, we
confirm that the dataset is indeed a panel, however it is unbalanced. We
can say that the dataset is unbalanced, since when we look at the unique
observations table, Chicks 18,16, and 15 were all observed less than 12
times. This means that they have missing data for some time periods,
making the panel unbalanced.
Chick18 <- data[data$Chick==18,]
print(Chick18)
## Grouped Data: weight ~ Time | Chick
## weight Time Chick Diet
## 195 39 0 18 1
## 196 35 2 18 1
Chick16 <- data[data$Chick==16,]
print(Chick16)
## Grouped Data: weight ~ Time | Chick
## weight Time Chick Diet
## 176 41 0 16 1
## 177 45 2 16 1
## 178 49 4 16 1
## 179 51 6 16 1
## 180 57 8 16 1
## 181 51 10 16 1
## 182 54 12 16 1
Chick15 <- data[data$Chick==15,]
print(Chick15)
## Grouped Data: weight ~ Time | Chick
## weight Time Chick Diet
## 168 41 0 15 1
## 169 49 2 15 1
## 170 56 4 15 1
## 171 64 6 15 1
## 172 68 8 15 1
## 173 68 10 15 1
## 174 67 12 15 1
## 175 68 14 15 1
An interesting trend emerges, where three of the Chicks
who were on Diet 1 dropped out before the end of their observation. It
is likely that they died, and therefore were not observed for the
remaining periods. The death hypothesis is supported by the fact that
the weights of the Chickens either decreased or stagnated during their
last life periods, indicating illness or some other issues.
Dealing with these missing observations could be complicated. If we
drop them, and the reason why the Chicken are dying is because of their
diet, it could overestimate the beta coefficient on Diet 1, since I am
getting rid of the most negative observations. On the other hand, with
so many of the other chicken on Diet 1 surviving, it is unlikely that
the diet is the cause for their illness, and so I will drop all of the
incomplete observations, balancing the dataset.
data2 <- data %>% mutate(Diet1 = ifelse(Diet == "1", 1, 0),
Diet2 = ifelse(Diet == "2", 1, 0),
Diet3 = ifelse(Diet == "3", 1, 0),
Diet4 = ifelse(Diet == "4", 1, 0))
ChickWBinary <- data %>% mutate(Diet1 = ifelse(Diet == "1", 1, 0),
Diet2 = ifelse(Diet == "2", 1, 0),
Diet3 = ifelse(Diet == "3", 1, 0),
Diet4 = ifelse(Diet == "4", 1, 0))
ChickWBinary <- ChickWBinary[ChickWBinary$Chick!=18,]
ChickWBinary <- ChickWBinary[ChickWBinary$Chick!=16,]
ChickWBinary <- ChickWBinary[ChickWBinary$Chick!=15,]
head(data)
## Grouped Data: weight ~ Time | Chick
## weight Time Chick Diet
## 1 42 0 1 1
## 2 51 2 1 1
## 3 59 4 1 1
## 4 64 6 1 1
## 5 76 8 1 1
## 6 93 10 1 1
OLS_lm <- lm(weight~Diet1 + Diet2 + Diet3+Time,data=ChickWBinary)
OLS_lm2 <- lm(weight~Diet1 + Diet2 + Diet3+Time,data=data2)
fixed_effects_plm <- plm(weight ~ Diet1 + Diet2 + Diet3 + Time,data=ChickWBinary,index="Chick",model="within")
rand_effects_plm <- plm(weight ~ Diet1 + Diet2 + Diet3 + Time,data=ChickWBinary,index="Chick",model="random")
stargazer(OLS_lm,OLS_lm2,fixed_effects_plm,rand_effects_plm,type="text")
##
## =============================================================================================================
## Dependent variable:
## -----------------------------------------------------------------------------------------
## weight
## OLS panel
## linear
## (1) (2) (3) (4)
## -------------------------------------------------------------------------------------------------------------
## Diet1 -29.463*** -30.233*** -29.233***
## (4.178) (4.107) (9.953)
##
## Diet2 -14.077*** -14.067*** -13.816
## (4.679) (4.667) (11.166)
##
## Diet3 6.256 6.266 6.518
## (4.679) (4.667) (11.166)
##
## Time 8.811*** 8.750*** 8.794*** 8.796***
## (0.226) (0.222) (0.176) (0.176)
##
## Constant 40.512*** 41.158*** 40.405***
## (4.113) (4.083) (8.122)
##
## -------------------------------------------------------------------------------------------------------------
## Observations 561 578 561 561
## R2 0.744 0.745 0.829 0.819
## Adjusted R2 0.742 0.744 0.814 0.818
## Residual Std. Error 36.088 (df = 556) 35.993 (df = 573)
## F Statistic 404.252*** (df = 4; 556) 419.177*** (df = 4; 573) 2,494.471*** (df = 1; 513) 2,513.981***
## =============================================================================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
As we can see, The OLS produces five coefficients, while
the within fixed effects estimator produces only one. This is due to the
fact that within estimators get rid of time invariant variables in the
demeaning process, and so since all chickens consume the same diet their
entire lifespan in this dataset, the diet variables are reduced to zero.
In the OLS regression, we see that, on average, chicks on
diet 1 are 29.463 grams lighter, compared to chicken on diet 4 (the base
group), holding all else constant. Furthermore, chicks on diet 2 are, on
average 14.077 grams ligher, relative to the chicken on diet 4, holding
all else constant. Both of these results are statistically significant,
and since we can assume that diets were assigned randomly, we can be
fairly confident about the size and the direction of the coefficients.
To make sure our dropping of the incomplete observations did
not drastically skew the results, I run a second OLS regression (2),
which uses data including the incomplete observations. As we can see,
the results are practically identical, and so dropping the incomplete
observations did not cause any serious issues.
The Fixed
Effects regression shows only a single coefficient, and that is for the
time variable. However, the results are very similar to the OLS results,
which indicates that the Time variable is almost completely uncorrelated
with time invariant variables not included in the model. This is
perhapse an indication that FE might not be the best model to use.
Although the R squared is about eight percentage points higher in the FE
model, it comes at the cost of not being able to estimate all of the
Diet variables and lacking an intercept coefficient. Regardless of how
we calculate the FE model, the issue would remain the same, due to the
fact that when using the within estimator, in the demeaning process time
invariant variables dissapear. Perhapse if we included individual level
binary variables, this issue might be alleviated a little bit, but that
runs into issues of its own.
To alleviate these issues, I
also included model 4, which is a random effects model. RE, unlike FE,
does not fully demean the variables in the dataset, only quasi-demeaning
them, which allows it to also estimate slope coefficients for time
invariant variables. The condition that needs to be met with RE is that
now instead of the RHS variables only having to be uncorrelated with
time variant variables in the error term, they now have to be
uncorrelated with both time variant and time invariant variables in the
error term. This combination is called the composite error, and is
denoted \(v_{it}\), where \(v_{it}\) = \(\alpha_{i}\) + \(u_{it}\), where \(\alpha_{i}\) represents the time invariant
individual characteristics and \(u_{it}\) represents the traditional error
term. Luckily, thanks to the fact that the diets were selcted randomly,
the RE model meets the necessary conditions and would be a decent
replacement for the FE model, if we wanted to include the time invariant
variables in the model.