A spot of analysis here, as an example how to scientifically assess if one way of doing things (say a certain diet) is better than others.
We’re looking here, at data about some chickens. These chickens were fed one of 4 different diets for a time period of 3 weeks. Their weights were then recorded every day. Let us glance at a summary.
## Tabulate chickens by diet
ChickWeight %>% group_by(Diet) %>% summarize(number_of_chicken = n_distinct(Chick))
## # A tibble: 4 x 2
## Diet number_of_chicken
## <fct> <int>
## 1 1 20
## 2 2 10
## 3 3 10
## 4 4 10
## Tabulate chickens by time for which they were tracked
ChickWeight %>% group_by(Time) %>% summarize(number_of_chicken = n_distinct(Chick))
## # A tibble: 12 x 2
## Time number_of_chicken
## <dbl> <int>
## 1 0 50
## 2 2 50
## 3 4 49
## 4 6 49
## 5 8 49
## 6 10 49
## 7 12 49
## 8 14 48
## 9 16 47
## 10 18 47
## 11 20 46
## 12 21 45
You can see from above that 50 different chicken were tracked over a period of 3 weeks. They were fed one of four different diets. 20 of them were fed diet 1 and 10 each were fed one of the 3 other diets.
However not all chicken were tracked till the end of 21 days. Of the 50 chicken, only 45 are left by day 21. In the absence of other information, the dark side of me is going to assume that 5 of them died. In my analysis, I stick to the 45 for who I have data for the whole 21 days.
##Convert the factor variable Chick to an integer variable
ChickWeight$intchick <- as.numeric(as.character(ChickWeight$Chick))
##identify survivors
survivors<- ChickWeight %>% filter(Time == 21) %>% distinct(intchick)
##filter out survivors
ChickWeight2 <- filter(ChickWeight, Chick %in% survivors$intchick)
Now that we have a dataset of just the 45 chicken we are interested in, let us look at some plots.
Let’s look at a summary of the change in weights for each chicken over 21 days, grouped by the diet they followed.
ggplot(Chickweight_delta) + stat_summary( mapping = aes(x = Diet.x, y = DeltaWeight), fun.ymin = min, fun.ymax = max, fun.y = mean) +labs(Title = 'Change in weights over a 21 day period are lowest for Diet 1 and highest for Diet 3', x = 'Diets' , y = 'Change in weight in gms' )
## Statistical Analysis
Visually it may look like we’ve found a winner in diet 3. However, how can we derive knowledge and a level of certainty in a mathematically rigorous manner ? An ANOVA or analysis of variance is often used in such situations. Simply put an ANOVA tells us whether the variations of the delta change in weight among chickens following different diets are significantly different from the overall variation in change in weight, in the sample. After all chickens are all growing at different rates, are we sure that the variation we see is because of the difference in diet ?
summary(lm( DeltaWeight ~ Diet.x, Chickweight_delta))
##
## Call:
## lm(formula = DeltaWeight ~ Diet.x, data = Chickweight_delta)
##
## Residuals:
## Min 1Q Median 3Q Max
## -142.000 -42.000 -0.667 38.813 127.813
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 136.19 16.05 8.483 1.45e-10 ***
## Diet.x2 37.81 25.89 1.461 0.151738
## Diet.x3 93.31 25.89 3.604 0.000839 ***
## Diet.x4 61.48 26.76 2.298 0.026762 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 64.22 on 41 degrees of freedom
## Multiple R-squared: 0.2558, Adjusted R-squared: 0.2014
## F-statistic: 4.698 on 3 and 41 DF, p-value: 0.006551