Load required packages.
library(dplyr)
library(ggplot2)
library(tidyverse)
Summary and str to view the data; each treatment group has 10 data points, with 3 different treatment groups (factors). Weight is numerical and it is a continuous variable. Thus, based on variable type, a possible method to analyse this data set is 1-way ANOVA.
{data(PlantGrowth)}
str(PlantGrowth)
## 'data.frame': 30 obs. of 2 variables:
## $ weight: num 4.17 5.58 5.18 6.11 4.5 4.61 5.17 4.53 5.33 5.14 ...
## $ group : Factor w/ 3 levels "ctrl","trt1",..: 1 1 1 1 1 1 1 1 1 1 ...
summary(PlantGrowth)
## weight group
## Min. :3.590 ctrl:10
## 1st Qu.:4.550 trt1:10
## Median :5.155 trt2:10
## Mean :5.073
## 3rd Qu.:5.530
## Max. :6.310
We should test for linear model assumptions.
Plot a qq plot for each treatment group to check normality of data.
PlantGrowth %>%
ggplot(aes(sample = weight, color = group)) + # 'sample' for stat_qq
stat_qq() + # Q-Q plot points
stat_qq_line() + # Q-Q line
facet_grid(rows = vars(group)) + # separate rows by group
labs(x = "Theoretical Quantiles", y = "Sample Quantiles") +
theme_minimal()
Data looks fairly normal as they follow the straight normal distribution line, although all three groups seem to have a longer right tail. To confirm normality, use Shapiro test for the three treatment groups.
# Shapiro test for all 3 groups
shapiro.test(PlantGrowth$weight[PlantGrowth$group == "ctrl"])
##
## Shapiro-Wilk normality test
##
## data: PlantGrowth$weight[PlantGrowth$group == "ctrl"]
## W = 0.95668, p-value = 0.7475
shapiro.test(PlantGrowth$weight[PlantGrowth$group == "trt1"])
##
## Shapiro-Wilk normality test
##
## data: PlantGrowth$weight[PlantGrowth$group == "trt1"]
## W = 0.93041, p-value = 0.4519
shapiro.test(PlantGrowth$weight[PlantGrowth$group == "trt2"])
##
## Shapiro-Wilk normality test
##
## data: PlantGrowth$weight[PlantGrowth$group == "trt2"]
## W = 0.94101, p-value = 0.5643
Since the p-values for all 3 factors are greater than 0.05 significance level, we do not reject the null hypothesis that the data is normal. The data is not statistically significant from the normal distribution.
We can look at the spread (variance) of data with a boxplot.
PlantGrowth %>% ggplot() +
geom_boxplot(aes(x = group, y = weight, colour = group),
size = 1, outlier.color = "red") +
labs(x = "Treatment Group", y = "Weight (g)",
title = "Boxplot of Plants by Treatment Group") +
scale_colour_manual(values = c("darkorange", "purple", "cyan4")) +
theme_minimal() +
theme(legend.position = "none")
The spread of data between treatment groups seem to be quite similar, although there are small differences (only some slight heterogeneity of variance). Since we don’t have information about the study design, we can also assume independence of data points. Hence, ANOVA will be used as it is robust enough to handle slight violations of the homoscedasticity assumption.
H0: There is no statistically significant difference between the weight of plants in each of the treatment groups. H1: There is a statistically significant difference between the weight of plants in at least one of the treatment groups.
PlantGrowth$group <- relevel(PlantGrowth$group, ref = "trt1")
model <- lm(weight~group, data=PlantGrowth)
anova(model)
## Analysis of Variance Table
##
## Response: weight
## Df Sum Sq Mean Sq F value Pr(>F)
## group 2 3.7663 1.8832 4.8461 0.01591 *
## Residuals 27 10.4921 0.3886
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The F ratio is 4.85 and the P-value is 0.016. As the P-value is lower than significance level (alpha) of 0.05, we reject H0. There is a statistically significant difference between the weight of plants in at least one of the treatment groups. The F ratio is calculated using the Mean square between groups/Mean square within groups. Hence as the F ratio is 4.85, it means that the mean square between groups is larger than the mean square within groups, and that there may be statistically significant differences between the weight of plants in different treatment groups.
The mean, sum of squared deviations (ssd), standard error (se), lower and upper confidence intervals are calculated per treatment type and placed into a dataframe in order to be retrieved later on.
# Using dplyr functions
plantgrowth_summary <-
PlantGrowth %>%
group_by(group) %>%
summarise(count = n(),
mean = mean(weight, na.rm = TRUE),
ssd = sd(weight, na.rm = TRUE)) %>%
mutate(se = ssd / sqrt(count),
lower_ci = mean - qt(1 - (0.05 / 2), count - 1) * se,
upper_ci = mean + qt(1 - (0.05 / 2), count - 1) * se)
# Nice table output
knitr::kable(plantgrowth_summary)
| group | count | mean | ssd | se | lower_ci | upper_ci |
|---|---|---|---|---|---|---|
| trt1 | 10 | 4.661 | 0.7936757 | 0.2509823 | 4.093239 | 5.228761 |
| ctrl | 10 | 5.032 | 0.5830914 | 0.1843897 | 4.614882 | 5.449118 |
| trt2 | 10 | 5.526 | 0.4425733 | 0.1399540 | 5.209402 | 5.842598 |
The summary dataframe created above is now used to find out which treatment is different compared to the others.
plantgrowth_summary %>% ggplot() +
geom_point(aes(x = group, y = mean, colour = group), size = 3) +
geom_errorbar(aes(x = group, ymin = lower_ci, ymax = upper_ci, colour = group),
width = 0.2, linewidth = 1) +
labs(x = "Treatment Group", y = "Weight (g)",
title = "Mean Weight of Plants by Treatment Group") +
scale_colour_manual(values = c("darkorange", "purple", "cyan4")) +
theme_minimal() +
theme(legend.position = "none")
From the graph above, the confidence intervals of treatment 1 and control overlap quite a bit, hence there is no statistically significant difference between them. The confidence intervals of control and treatment 2 also overlaps to an extent, and there may possibly be no statistically significant difference between them. The confidence intervals of treatment 1 and treatment 2 largely do not overlap, hence there is a statistically significant difference between the plant weights of treatment 1 and treatment 2.