Summary Statistics and the
Dataset
The first dataset I chose is the
ChickWeight dataset. It tracks fifty chickens over 21 days, during which
each chick is weighed twelve times.
This dataset uses four
variables: weight, time, chick, and diet. Weight is a simple
quantitative variable, showing the weight of each bird during a given
period in grams. Time is a variable that indicates the number of days
since the chick’s birth. The Chick variable is a unique identifier
variable, attaching a single unique number to each chick. Finally, diet
is a discrete variable ranging from one to four, identifying which diet
the specific chick is on.
Since there are multiple chickens, each observed for numerous
periods, we can confidently say that we are working with a panel
dataset.
head(ChickWeight,n=20)
## weight Time Chick Diet
## 1 42 0 1 1
## 2 51 2 1 1
## 3 59 4 1 1
## 4 64 6 1 1
## 5 76 8 1 1
## 6 93 10 1 1
## 7 106 12 1 1
## 8 125 14 1 1
## 9 149 16 1 1
## 10 171 18 1 1
## 11 199 20 1 1
## 12 205 21 1 1
## 13 40 0 2 1
## 14 49 2 2 1
## 15 58 4 2 1
## 16 72 6 2 1
## 17 84 8 2 1
## 18 103 10 2 1
## 19 122 12 2 1
## 20 138 14 2 1
summary(ChickWeight)
## weight Time Chick Diet
## Min. : 35.0 Min. : 0.00 13 : 12 1:220
## 1st Qu.: 63.0 1st Qu.: 4.00 9 : 12 2:120
## Median :103.0 Median :10.00 20 : 12 3:120
## Mean :121.8 Mean :10.72 10 : 12 4:118
## 3rd Qu.:163.8 3rd Qu.:16.00 17 : 12
## Max. :373.0 Max. :21.00 19 : 12
## (Other):506
As we can see from the snapshot of the dataset and the summary statistics, there are 578 observations in this dataset. Since each chick should be observed twelve times, that is 22 fewer observations than expected. This discrepancy indicates that this is an unbalanced panel dataset. The reason for the missing observations is unclear, though it could be caused by some chicks dying before the end of their observation period or simply getting lost.
Graphs
Since
there are fifty different chicks, it is quite confusing to include each
of them in a graph as a single line. To alleviate this issue, I decided
to plot the relationship between the time and weight of the chick,
grouping the chickens by their diets. As we can see, diets three and
four provide the most substantial results for chick growth.
library(dplyr)
library(ggplot2)
ggplot(data=ChickWeight)+geom_line(mapping=aes(x=Time, y=weight,color=Chick))+labs(y="The Weight of a Chicken",x="Time Since Birth of a Chicken", title="The Impact of Different Diets on the Weight of a Chicken")
ggplot(data=ChickWeight)+geom_smooth(mapping=aes(x=Time, y=weight,color=Diet))+labs(y="The Weight of a Chicken",x="Time Since Birth of a Chicken", title="The Impact of Different Diets on the Weight of a Chicken")
Regressions
To
see the whole picture, I decided to run two additional regressions to
observe the impact that different diets have on the growth of chickens.
To do that, I created four binary variables, each representing the diet
a given chicken was fed. After completing that, I ran an ordinary OLS
model (Right) and a Random Effects model (Left). I decided to run a
Random Effects model because if I were to use the Fixed Effects model, I
would not see any coefficients on the diet variables because they are
time-invariant. That being said, it is unlikely that the random effects
model will face endogeneity issues since it is likely that the diets
were assigned randomly, so the diet variables are unlikely to be
correlated with the composite error term, violating the ZCM rule or the
requirements of RE models.
In the OLS model, we can see that
time, Diet1, and Diet2 are our only statistically significant variables.
Diet1 and Diet2 both have a negative coefficient, indicating that chicks
on these diets tend to be lighter compared to the Chicks on Diet4, which
is the base group. Since Diet3 and Diet4 provide no statistically
significant evidence, we can’t make any assertions about them, however,
we can see that birds consuming Diet1 and Diet2 are disadvantaged
compared to them. The Random effects model provides very similar
results, with the exception of the Diet2 coefficient becoming
statistically insignificant. Since the RE should control for more
variation within our model, the change in significance is unsurprising.
Many individual characteristics of the chicks, such as genetic
conditions or general propensity for illness, which would affect weight
and growth, are at least partially factored out by the RE, ergo the drop
in significance of the Diet2 variable.
library(dplyr)
library(plm)
library(sjPlot)
library(sjmisc)
library(sjlabelled)
ChickWBinary <- ChickWeight %>% mutate(Diet1 = ifelse(Diet == "1", 1, 0),
Diet2 = ifelse(Diet == "2", 1, 0),
Diet3 = ifelse(Diet == "3", 1, 0),
Diet4 = ifelse(Diet == "4", 1, 0))
OLS_lm <- lm(weight~Diet1 + Diet2 + Diet3+Time,data=ChickWBinary)
rand_effects_plm <- plm(weight ~ Diet1 + Diet2 + Diet3 + Time,data=ChickWBinary,index="Chick",model="random")
tab_model(rand_effects_plm, OLS_lm)
| weight | weight | |||||
|---|---|---|---|---|---|---|
| Predictors | Estimates | CI | p | Estimates | CI | p |
| (Intercept) | 41.26 | 25.53 – 56.99 | <0.001 | 41.16 | 33.14 – 49.18 | <0.001 |
| Diet1 | -30.01 | -48.81 – -11.21 | 0.002 | -30.23 | -38.30 – -22.17 | <0.001 |
| Diet2 | -13.80 | -35.41 – 7.81 | 0.210 | -14.07 | -23.23 – -4.90 | 0.003 |
| Diet3 | 6.53 | -15.08 – 28.14 | 0.553 | 6.27 | -2.90 – 15.43 | 0.180 |
| Time | 8.72 | 8.37 – 9.06 | <0.001 | 8.75 | 8.31 – 9.19 | <0.001 |
| Observations | 578 | 578 | ||||
| R2 / R2 adjusted | 0.813 / 0.811 | 0.745 / 0.744 | ||||
Summary Statistics and the
Dataset
The second dataset I chose is a
dataset observing the girth, height, and volume of 31 felled black
cherry trees. All three variables are quantitative, and their range can
be seen in the summary statistics. Since all trees are observed
simultaneously, this is a perfect example of a cross-sectional dataset.
head(trees,n=20)
## Girth Height Volume
## 1 8.3 70 10.3
## 2 8.6 65 10.3
## 3 8.8 63 10.2
## 4 10.5 72 16.4
## 5 10.7 81 18.8
## 6 10.8 83 19.7
## 7 11.0 66 15.6
## 8 11.0 75 18.2
## 9 11.1 80 22.6
## 10 11.2 75 19.9
## 11 11.3 79 24.2
## 12 11.4 76 21.0
## 13 11.4 76 21.4
## 14 11.7 69 21.3
## 15 12.0 75 19.1
## 16 12.9 74 22.2
## 17 12.9 85 33.8
## 18 13.3 86 27.4
## 19 13.7 71 25.7
## 20 13.8 64 24.9
summary(trees)
## Girth Height Volume
## Min. : 8.30 Min. :63 Min. :10.20
## 1st Qu.:11.05 1st Qu.:72 1st Qu.:19.40
## Median :12.90 Median :76 Median :24.20
## Mean :13.25 Mean :76 Mean :30.17
## 3rd Qu.:15.25 3rd Qu.:80 3rd Qu.:37.30
## Max. :20.60 Max. :87 Max. :77.00
Graphs
The
most natural way to graph cross-sectional data is scatter-plot graphs.
Underneath, you will see two charts, one showing the relationship
between the girth of a tree and its height and the relationship between
the volume of a tree and its height. The results are pretty
unsurprising. A tree’s height is positively correlated with the tree’s
girth and the tree volume.
library(ggplot2)
ggplot(data=trees,aes(y=Height, x=Girth)) + geom_point() + geom_smooth(method=lm, se=FALSE)+labs(y="Height of a Tree",x="Girth of a Tree", title="The Relationship Between Tree Height and Girth")
ggplot(data=trees,aes(y=Height, x=Volume))+ geom_point() + geom_smooth(method=lm,se=FALSE)+labs(y="Height of a Tree",x="Volume of a Tree", title="The Relationship Between Tree Height and Volume")
In statistics, variance is a measure of distance from a
mean. The higher the variable’s variance, the more “spread out” the
distribution of said variable will be around the mean. Covariance
determines the direction of a relationship between two variables; a
positive covariance indicates a positive relationship, while a negative
covariance indicates a negative relationship.
The basic
question that a Beta coefficient tries to answer is what is the expected
increase in y given a one-unit increase in x. Since the variance gives
us the spread of x, and Cov gives us the direction of the relationship
between x and y, by normalizing Cov(x,y) by the Var(x), we get a
digestible coefficient result. So, we are normalizing the direction of
the relationship between x and y by the spread of x.
To prove
that Cov(x,y)/Var(x,y) is identical to the slope coefficient of an OLS
model, I will reach back to the regressions I ran with my first dataset.
To make things easier, I will only use one dependent and one independent
variable.
OLS_lmSimpl <- lm(weight~Time,data=ChickWeight)
tab_model(OLS_lmSimpl)
| weight | |||
|---|---|---|---|
| Predictors | Estimates | CI | p |
| (Intercept) | 27.47 | 21.50 – 33.43 | <0.001 |
| Time | 8.80 | 8.33 – 9.27 | <0.001 |
| Observations | 578 | ||
| R2 / R2 adjusted | 0.701 / 0.700 | ||
According to the results of this regression, for each
additional day from birth, a Chicken will gain 8.8 grams of weight. Now,
let’s compare that result to a manually calculated coefficient.
Beta1 <- cov(ChickWeight$Time,ChickWeight$weight)/var(ChickWeight$Time)
print(Beta1)
## [1] 8.803039
As we can see, both results are identical, proving that
dividing Cov(x,y) by the Var(x) gives us an OLS coefficient
estimate.