Discussion Post #2

Part I.
Dataset 1 ~ ChickWeight

Summary Statistics and the Dataset
      The first dataset I chose is the ChickWeight dataset. It tracks fifty chickens over 21 days, during which each chick is weighed twelve times.
     This dataset uses four variables: weight, time, chick, and diet. Weight is a simple quantitative variable, showing the weight of each bird during a given period in grams. Time is a variable that indicates the number of days since the chick’s birth. The Chick variable is a unique identifier variable, attaching a single unique number to each chick. Finally, diet is a discrete variable ranging from one to four, identifying which diet the specific chick is on.
     Since there are multiple chickens, each observed for numerous periods, we can confidently say that we are working with a panel dataset.

head(ChickWeight,n=20)

##    weight Time Chick Diet
## 1      42    0     1    1
## 2      51    2     1    1
## 3      59    4     1    1
## 4      64    6     1    1
## 5      76    8     1    1
## 6      93   10     1    1
## 7     106   12     1    1
## 8     125   14     1    1
## 9     149   16     1    1
## 10    171   18     1    1
## 11    199   20     1    1
## 12    205   21     1    1
## 13     40    0     2    1
## 14     49    2     2    1
## 15     58    4     2    1
## 16     72    6     2    1
## 17     84    8     2    1
## 18    103   10     2    1
## 19    122   12     2    1
## 20    138   14     2    1

summary(ChickWeight)

##      weight           Time           Chick     Diet   
##  Min.   : 35.0   Min.   : 0.00   13     : 12   1:220  
##  1st Qu.: 63.0   1st Qu.: 4.00   9      : 12   2:120  
##  Median :103.0   Median :10.00   20     : 12   3:120  
##  Mean   :121.8   Mean   :10.72   10     : 12   4:118  
##  3rd Qu.:163.8   3rd Qu.:16.00   17     : 12          
##  Max.   :373.0   Max.   :21.00   19     : 12          
##                                  (Other):506

As we can see from the snapshot of the dataset and the summary statistics, there are 578 observations in this dataset. Since each chick should be observed twelve times, that is 22 fewer observations than expected. This discrepancy indicates that this is an unbalanced panel dataset. The reason for the missing observations is unclear, though it could be caused by some chicks dying before the end of their observation period or simply getting lost.

Graphs
Since there are fifty different chicks, it is quite confusing to include each of them in a graph as a single line. To alleviate this issue, I decided to plot the relationship between the time and weight of the chick, grouping the chickens by their diets. As we can see, diets three and four provide the most substantial results for chick growth.

library(dplyr)
library(ggplot2)
ggplot(data=ChickWeight)+geom_line(mapping=aes(x=Time, y=weight,color=Chick))+labs(y="The Weight of a Chicken",x="Time Since Birth of a Chicken", title="The Impact of Different Diets on the Weight of a Chicken")

ggplot(data=ChickWeight)+geom_smooth(mapping=aes(x=Time, y=weight,color=Diet))+labs(y="The Weight of a Chicken",x="Time Since Birth of a Chicken", title="The Impact of Different Diets on the Weight of a Chicken")

Regressions
To see the whole picture, I decided to run two additional regressions to observe the impact that different diets have on the growth of chickens. To do that, I created four binary variables, each representing the diet a given chicken was fed. After completing that, I ran an ordinary OLS model (Right) and a Random Effects model (Left). I decided to run a Random Effects model because if I were to use the Fixed Effects model, I would not see any coefficients on the diet variables because they are time-invariant. That being said, it is unlikely that the random effects model will face endogeneity issues since it is likely that the diets were assigned randomly, so the diet variables are unlikely to be correlated with the composite error term, violating the ZCM rule or the requirements of RE models.
In the OLS model, we can see that time, Diet1, and Diet2 are our only statistically significant variables. Diet1 and Diet2 both have a negative coefficient, indicating that chicks on these diets tend to be lighter compared to the Chicks on Diet4, which is the base group. Since Diet3 and Diet4 provide no statistically significant evidence, we can’t make any assertions about them, however, we can see that birds consuming Diet1 and Diet2 are disadvantaged compared to them. The Random effects model provides very similar results, with the exception of the Diet2 coefficient becoming statistically insignificant. Since the RE should control for more variation within our model, the change in significance is unsurprising. Many individual characteristics of the chicks, such as genetic conditions or general propensity for illness, which would affect weight and growth, are at least partially factored out by the RE, ergo the drop in significance of the Diet2 variable.

library(dplyr)
library(plm)
library(sjPlot)
library(sjmisc)
library(sjlabelled)

ChickWBinary <- ChickWeight %>% mutate(Diet1 = ifelse(Diet == "1", 1, 0),
                                       Diet2 = ifelse(Diet == "2", 1, 0),
                                       Diet3 = ifelse(Diet == "3", 1, 0),
                                       Diet4 = ifelse(Diet == "4", 1, 0))
OLS_lm <- lm(weight~Diet1 + Diet2 + Diet3+Time,data=ChickWBinary)

rand_effects_plm <- plm(weight ~ Diet1 + Diet2 + Diet3 + Time,data=ChickWBinary,index="Chick",model="random")

tab_model(rand_effects_plm, OLS_lm)

	weight			weight
Predictors	Estimates	CI	p	Estimates	CI	p
(Intercept)	41.26	25.53 – 56.99	<0.001	41.16	33.14 – 49.18	<0.001
Diet1	-30.01	-48.81 – -11.21	0.002	-30.23	-38.30 – -22.17	<0.001
Diet2	-13.80	-35.41 – 7.81	0.210	-14.07	-23.23 – -4.90	0.003
Diet3	6.53	-15.08 – 28.14	0.553	6.27	-2.90 – 15.43	0.180
Time	8.72	8.37 – 9.06	<0.001	8.75	8.31 – 9.19	<0.001
Observations	578			578
R² / R² adjusted	0.813 / 0.811			0.745 / 0.744

Dataset 2 ~ trees

Summary Statistics and the Dataset
The second dataset I chose is a dataset observing the girth, height, and volume of 31 felled black cherry trees. All three variables are quantitative, and their range can be seen in the summary statistics. Since all trees are observed simultaneously, this is a perfect example of a cross-sectional dataset.

head(trees,n=20)

##    Girth Height Volume
## 1    8.3     70   10.3
## 2    8.6     65   10.3
## 3    8.8     63   10.2
## 4   10.5     72   16.4
## 5   10.7     81   18.8
## 6   10.8     83   19.7
## 7   11.0     66   15.6
## 8   11.0     75   18.2
## 9   11.1     80   22.6
## 10  11.2     75   19.9
## 11  11.3     79   24.2
## 12  11.4     76   21.0
## 13  11.4     76   21.4
## 14  11.7     69   21.3
## 15  12.0     75   19.1
## 16  12.9     74   22.2
## 17  12.9     85   33.8
## 18  13.3     86   27.4
## 19  13.7     71   25.7
## 20  13.8     64   24.9

summary(trees)

##      Girth           Height       Volume     
##  Min.   : 8.30   Min.   :63   Min.   :10.20  
##  1st Qu.:11.05   1st Qu.:72   1st Qu.:19.40  
##  Median :12.90   Median :76   Median :24.20  
##  Mean   :13.25   Mean   :76   Mean   :30.17  
##  3rd Qu.:15.25   3rd Qu.:80   3rd Qu.:37.30  
##  Max.   :20.60   Max.   :87   Max.   :77.00

Graphs
The most natural way to graph cross-sectional data is scatter-plot graphs. Underneath, you will see two charts, one showing the relationship between the girth of a tree and its height and the relationship between the volume of a tree and its height. The results are pretty unsurprising. A tree’s height is positively correlated with the tree’s girth and the tree volume.

library(ggplot2)
ggplot(data=trees,aes(y=Height, x=Girth)) + geom_point() + geom_smooth(method=lm, se=FALSE)+labs(y="Height of a Tree",x="Girth of a Tree", title="The Relationship Between Tree Height and Girth")

ggplot(data=trees,aes(y=Height, x=Volume))+ geom_point() + geom_smooth(method=lm,se=FALSE)+labs(y="Height of a Tree",x="Volume of a Tree", title="The Relationship Between Tree Height and Volume")

Part II.

      In statistics, variance is a measure of distance from a mean. The higher the variable’s variance, the more “spread out” the distribution of said variable will be around the mean. Covariance determines the direction of a relationship between two variables; a positive covariance indicates a positive relationship, while a negative covariance indicates a negative relationship.
      The basic question that a Beta coefficient tries to answer is what is the expected increase in y given a one-unit increase in x. Since the variance gives us the spread of x, and Cov gives us the direction of the relationship between x and y, by normalizing Cov(x,y) by the Var(x), we get a digestible coefficient result. So, we are normalizing the direction of the relationship between x and y by the spread of x.
      To prove that Cov(x,y)/Var(x,y) is identical to the slope coefficient of an OLS model, I will reach back to the regressions I ran with my first dataset. To make things easier, I will only use one dependent and one independent variable.

OLS_lmSimpl <- lm(weight~Time,data=ChickWeight)
tab_model(OLS_lmSimpl)

	weight
Predictors	Estimates	CI	p
(Intercept)	27.47	21.50 – 33.43	<0.001
Time	8.80	8.33 – 9.27	<0.001
Observations	578
R² / R² adjusted	0.701 / 0.700

According to the results of this regression, for each additional day from birth, a Chicken will gain 8.8 grams of weight. Now, let’s compare that result to a manually calculated coefficient.

Beta1 <- cov(ChickWeight$Time,ChickWeight$weight)/var(ChickWeight$Time)
print(Beta1)

## [1] 8.803039

As we can see, both results are identical, proving that dividing Cov(x,y) by the Var(x) gives us an OLS coefficient estimate.

Discussion Post #2

Samuel C. Singer

2023-09-08