Dataset Overview


      In this discussion, I will be using the ChickWeight pre-installed dataset. This is a panel dataset which tracks the weights of chicken over a period of 21 days, providing additional information on their diet and other variables.

data <- ChickWeight
head(ChickWeight,n=20)
## Grouped Data: weight ~ Time | Chick
##    weight Time Chick Diet
## 1      42    0     1    1
## 2      51    2     1    1
## 3      59    4     1    1
## 4      64    6     1    1
## 5      76    8     1    1
## 6      93   10     1    1
## 7     106   12     1    1
## 8     125   14     1    1
## 9     149   16     1    1
## 10    171   18     1    1
## 11    199   20     1    1
## 12    205   21     1    1
## 13     40    0     2    1
## 14     49    2     2    1
## 15     58    4     2    1
## 16     72    6     2    1
## 17     84    8     2    1
## 18    103   10     2    1
## 19    122   12     2    1
## 20    138   14     2    1
summary(ChickWeight)
##      weight           Time           Chick     Diet   
##  Min.   : 35.0   Min.   : 0.00   13     : 12   1:220  
##  1st Qu.: 63.0   1st Qu.: 4.00   9      : 12   2:120  
##  Median :103.0   Median :10.00   20     : 12   3:120  
##  Mean   :121.8   Mean   :10.72   10     : 12   4:118  
##  3rd Qu.:163.8   3rd Qu.:16.00   17     : 12          
##  Max.   :373.0   Max.   :21.00   19     : 12          
##                                  (Other):506
freq <- table(ChickWeight$Chick)
print(freq)
## 
## 18 16 15 13  9 20 10  8 17 19  4  6 11  3  1 12  2  5 14  7 24 30 22 23 27 28 
##  2  7  8 12 12 12 12 11 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 
## 26 25 29 21 33 37 36 31 39 38 32 40 34 35 44 45 43 41 47 49 46 50 42 48 
## 12 12 12 12 12 12 12 12 12 12 12 12 12 12 10 12 12 12 12 12 12 12 12 12


      According to some basic EDA that can be seen above, we confirm that the dataset is indeed a panel, however it is unbalanced. We can say that the dataset is unbalanced, since when we look at the unique observations table, Chicks 18,16, and 15 were all observed less than 12 times. This means that they have missing data for some time periods, making the panel unbalanced.

Chick18 <- data[data$Chick==18,]
print(Chick18)
## Grouped Data: weight ~ Time | Chick
##     weight Time Chick Diet
## 195     39    0    18    1
## 196     35    2    18    1
Chick16 <- data[data$Chick==16,]
print(Chick16)
## Grouped Data: weight ~ Time | Chick
##     weight Time Chick Diet
## 176     41    0    16    1
## 177     45    2    16    1
## 178     49    4    16    1
## 179     51    6    16    1
## 180     57    8    16    1
## 181     51   10    16    1
## 182     54   12    16    1
Chick15 <- data[data$Chick==15,]
print(Chick15)
## Grouped Data: weight ~ Time | Chick
##     weight Time Chick Diet
## 168     41    0    15    1
## 169     49    2    15    1
## 170     56    4    15    1
## 171     64    6    15    1
## 172     68    8    15    1
## 173     68   10    15    1
## 174     67   12    15    1
## 175     68   14    15    1


      An interesting trend emerges, where three of the Chicks who were on Diet 1 dropped out before the end of their observation. It is likely that they died, and therefore were not observed for the remaining periods. The death hypothesis is supported by the fact that the weights of the Chickens either decreased or stagnated during their last life periods, indicating illness or some other issues.
     Dealing with these missing observations could be complicated. If we drop them, and the reason why the Chicken are dying is because of their diet, it could overestimate the beta coefficient on Diet 1, since I am getting rid of the most negative observations. On the other hand, with so many of the other chicken on Diet 1 surviving, it is unlikely that the diet is the cause for their illness, and so I will drop all of the incomplete observations, balancing the dataset.

Regressions

data2 <- data %>% mutate(Diet1 = ifelse(Diet == "1", 1, 0),
                                       Diet2 = ifelse(Diet == "2", 1, 0),
                                       Diet3 = ifelse(Diet == "3", 1, 0),
                                       Diet4 = ifelse(Diet == "4", 1, 0))

ChickWBinary <- data %>% mutate(Diet1 = ifelse(Diet == "1", 1, 0),
                                       Diet2 = ifelse(Diet == "2", 1, 0),
                                       Diet3 = ifelse(Diet == "3", 1, 0),
                                       Diet4 = ifelse(Diet == "4", 1, 0))

ChickWBinary <- ChickWBinary[ChickWBinary$Chick!=18,]
ChickWBinary <- ChickWBinary[ChickWBinary$Chick!=16,]
ChickWBinary <- ChickWBinary[ChickWBinary$Chick!=15,]


head(data)
## Grouped Data: weight ~ Time | Chick
##   weight Time Chick Diet
## 1     42    0     1    1
## 2     51    2     1    1
## 3     59    4     1    1
## 4     64    6     1    1
## 5     76    8     1    1
## 6     93   10     1    1
OLS_lm <- lm(weight~Diet1 + Diet2 + Diet3+Time,data=ChickWBinary)

OLS_lm2 <- lm(weight~Diet1 + Diet2 + Diet3+Time,data=data2)


fixed_effects_plm <- plm(weight ~ Diet1 + Diet2 + Diet3 + Time,data=ChickWBinary,index="Chick",model="within")

rand_effects_plm <- plm(weight ~ Diet1 + Diet2 + Diet3 + Time,data=ChickWBinary,index="Chick",model="random")

stargazer(OLS_lm,OLS_lm2,fixed_effects_plm,rand_effects_plm,type="text")
## 
## =============================================================================================================
##                                                        Dependent variable:                                   
##                     -----------------------------------------------------------------------------------------
##                                                              weight                                          
##                                            OLS                                         panel                 
##                                                                                       linear                 
##                               (1)                      (2)                       (3)                 (4)     
## -------------------------------------------------------------------------------------------------------------
## Diet1                      -29.463***               -30.233***                                    -29.233*** 
##                             (4.178)                  (4.107)                                       (9.953)   
##                                                                                                              
## Diet2                      -14.077***               -14.067***                                     -13.816   
##                             (4.679)                  (4.667)                                       (11.166)  
##                                                                                                              
## Diet3                        6.256                    6.266                                         6.518    
##                             (4.679)                  (4.667)                                       (11.166)  
##                                                                                                              
## Time                        8.811***                 8.750***                  8.794***            8.796***  
##                             (0.226)                  (0.222)                   (0.176)             (0.176)   
##                                                                                                              
## Constant                   40.512***                41.158***                                     40.405***  
##                             (4.113)                  (4.083)                                       (8.122)   
##                                                                                                              
## -------------------------------------------------------------------------------------------------------------
## Observations                  561                      578                       561                 561     
## R2                           0.744                    0.745                     0.829               0.819    
## Adjusted R2                  0.742                    0.744                     0.814               0.818    
## Residual Std. Error    36.088 (df = 556)        35.993 (df = 573)                                            
## F Statistic         404.252*** (df = 4; 556) 419.177*** (df = 4; 573) 2,494.471*** (df = 1; 513) 2,513.981***
## =============================================================================================================
## Note:                                                                             *p<0.1; **p<0.05; ***p<0.01


      As we can see, The OLS produces five coefficients, while the within fixed effects estimator produces only one. This is due to the fact that within estimators get rid of time invariant variables in the demeaning process, and so since all chickens consume the same diet their entire lifespan in this dataset, the diet variables are reduced to zero.
      In the OLS regression, we see that, on average, chicks on diet 1 are 29.463 grams lighter, compared to chicken on diet 4 (the base group), holding all else constant. Furthermore, chicks on diet 2 are, on average 14.077 grams ligher, relative to the chicken on diet 4, holding all else constant. Both of these results are statistically significant, and since we can assume that diets were assigned randomly, we can be fairly confident about the size and the direction of the coefficients.
      To make sure our dropping of the incomplete observations did not drastically skew the results, I run a second OLS regression (2), which uses data including the incomplete observations. As we can see, the results are practically identical, and so dropping the incomplete observations did not cause any serious issues.
      The Fixed Effects regression shows only a single coefficient, and that is for the time variable. However, the results are very similar to the OLS results, which indicates that the Time variable is almost completely uncorrelated with time invariant variables not included in the model. This is perhapse an indication that FE might not be the best model to use. Although the R squared is about eight percentage points higher in the FE model, it comes at the cost of not being able to estimate all of the Diet variables and lacking an intercept coefficient. Regardless of how we calculate the FE model, the issue would remain the same, due to the fact that when using the within estimator, in the demeaning process time invariant variables dissapear. Perhapse if we included individual level binary variables, this issue might be alleviated a little bit, but that runs into issues of its own.
      To alleviate these issues, I also included model 4, which is a random effects model. RE, unlike FE, does not fully demean the variables in the dataset, only quasi-demeaning them, which allows it to also estimate slope coefficients for time invariant variables. The condition that needs to be met with RE is that now instead of the RHS variables only having to be uncorrelated with time variant variables in the error term, they now have to be uncorrelated with both time variant and time invariant variables in the error term. This combination is called the composite error, and is denoted \(v_{it}\), where \(v_{it}\) = \(\alpha_{i}\) + \(u_{it}\), where \(\alpha_{i}\) represents the time invariant individual characteristics and \(u_{it}\) represents the traditional error term. Luckily, thanks to the fact that the diets were selcted randomly, the RE model meets the necessary conditions and would be a decent replacement for the FE model, if we wanted to include the time invariant variables in the model.