We’ll use the PlantGrowth data set that comes with R. It provides the weight of plants produced under two distinct treatment conditions and a control condition.
data <- PlantGrowth
We utilize the function sample n() [in the dplyr package] to get a sense of how the data looks.
set.seed(123)
dplyr::sample_n(data, 10)
weight group
1 5.87 trt1
2 4.32 trt1
3 3.59 trt1
4 5.18 ctrl
5 5.14 ctrl
6 4.89 trt1
7 5.12 trt2
8 4.81 trt1
9 4.50 ctrl
10 4.69 trt1
The column “group” is known as a factor in R, while the different categories (“ctr”, “trt1”, “trt2”) are known as factor levels. The levels are listed in alphabetical order.
levels(data$group)
[1] "ctrl" "trt1" "trt2"
If the levels are not in the correct order automatically, reorder them as follows:
data$group <- ordered(data$group,
levels = c("ctrl", "trt1", "trt2"))
The dplyr package can be used to compute summary statistics (mean and sd) by groups.
Calculate summary statistics by groups – count, mean, and standard deviation:
library(dplyr)
Attaching package: 'dplyr'
The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
group_by(data, group) %>%
summarise(
count = n(),
mean = mean(weight, na.rm = TRUE),
sd = sd(weight, na.rm = TRUE)
)
# A tibble: 3 × 4
group count mean sd
<ord> <int> <dbl> <dbl>
1 ctrl 10 5.03 0.583
2 trt1 10 4.66 0.794
3 trt2 10 5.53 0.443
Read R base graphs to learn how to utilize them. For an easy ggplot2-based data visualization, we’ll use the ggpubr R tool.
Visualize your data with ggpubr:
library("ggpubr")
Warning: package 'ggpubr' was built under R version 4.2.1
Loading required package: ggplot2
Warning: package 'ggplot2' was built under R version 4.2.1
ggboxplot(data, x = "group", y = "weight",
color = "group", palette = c("#00AFBB", "#E7B800", "#FC4E07"),
order = c("ctrl", "trt1", "trt2"),
ylab = "Weight", xlab = "Treatment")
Add error bars: mean_se
library("ggpubr")
ggline(data, x = "group", y = "weight",
add = c("mean_se", "jitter"),
order = c("ctrl", "trt1", "trt2"),
ylab = "Weight", xlab = "Treatment")
boxplot(weight ~ group, data = data,
xlab = "Treatment", ylab = "Weight",
frame = FALSE, col = c("#00AFBB", "#E7B800", "#FC4E07"))
library("gplots")
Warning: package 'gplots' was built under R version 4.2.1
Attaching package: 'gplots'
The following object is masked from 'package:stats':
lowess
plotmeans(weight ~ group, data = data, frame = FALSE,
xlab = "Treatment", ylab = "Weight",
main="Mean Plot with 95% CI")
Warning in plot.xy(xy.coords(x, y), type = type, ...): "frame" is not a
graphical parameter
Warning in axis(1, at = 1:length(means), labels = legends, ...): "frame" is not
a graphical parameter
Warning in plot.xy(xy.coords(x, y), type = type, ...): "frame" is not a
graphical parameter
We want to see if the average weights of the plants in the three experimental circumstances vary significantly.
This question can be answered using the R function aov(). Summary of the function The analysis of the variance model is summarised using aov().
Perform an analysis of variance.
res.aov <- aov(weight ~ group, data = data)
Summary of the analysis
summary(res.aov)
Df Sum Sq Mean Sq F value Pr(>F)
group 2 3.766 1.8832 4.846 0.0159 *
Residuals 27 10.492 0.3886
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The columns F value and Pr(>F) in the output corresponding to the p-value of the test.
The results of one-way ANOVA testing should be interpreted.
We can conclude that there are significant differences between the groups highlighted with “*” in the model summary because the p-value is less than the significance level of 0.05.
We can compute Tukey HSD (Tukey Honest Significant Differences, R function: TukeyHSD()) for doing numerous pairwise comparisons between the means of groups because the ANOVA test is significant.
The fitted ANOVA is passed to TukeyHD() as an input.
TukeyHSD(res.aov)
Tukey multiple comparisons of means
95% family-wise confidence level
Fit: aov(formula = weight ~ group, data = data)
$group
diff lwr upr p adj
trt1-ctrl -0.371 -1.0622161 0.3202161 0.3908711
trt2-ctrl 0.494 -0.1972161 1.1852161 0.1979960
trt2-trt1 0.865 0.1737839 1.5562161 0.0120064
lwr, upr: the lower and upper-end points of the confidence interval at 95 percent (default) p adj: p-value after multiple comparisons adjustment.
Only the difference between trt2 and trt1 is significant, as shown by the output, with an adjusted p-value of 0.012.
Using the multcomp package, perform several comparisons.
The function glht() [in the multcomp package] can be used to do multiple comparison processes for an ANOVA. General linear hypothesis tests are abbreviated as glht. The following is a simplified format:
library(multcomp)
Warning: package 'multcomp' was built under R version 4.2.1
Loading required package: mvtnorm
Loading required package: survival
Loading required package: TH.data
Warning: package 'TH.data' was built under R version 4.2.1
Loading required package: MASS
Attaching package: 'MASS'
The following object is masked from 'package:dplyr':
select
Attaching package: 'TH.data'
The following object is masked from 'package:MASS':
geyser
glht(model, lincft)
model: a model that has been fitted, such as an object returned by aov ().
lincft() specifies the linear hypotheses that will be tested. Objects provided from the function mcp are used to specify multiple comparisons in ANOVA models ().
For a one-way ANOVA, use glht() to make numerous pairwise comparisons:
library(multcomp)
summary(glht(res.aov, linfct = mcp(group = "Tukey")))
Simultaneous Tests for General Linear Hypotheses
Multiple Comparisons of Means: Tukey Contrasts
Fit: aov(formula = weight ~ group, data = data)
Linear Hypotheses:
Estimate Std. Error t value Pr(>|t|)
trt1 - ctrl == 0 -0.3710 0.2788 -1.331 0.3908
trt2 - ctrl == 0 0.4940 0.2788 1.772 0.1979
trt2 - trt1 == 0 0.8650 0.2788 3.103 0.0121 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Adjusted p values reported -- single-step method)
Pairwise comparisons across group levels with various testing corrections can also be calculated using the pairewise.t.test() function.
pairwise.t.test(data$weight, data$group,
p.adjust.method = "BH")
Pairwise comparisons using t tests with pooled SD
data: data$weight and data$group
ctrl trt1
trt1 0.194 -
trt2 0.132 0.013
P value adjustment method: BH
The output is a table of pairwise comparison p-values. The Benjamini-Hochberg method was used to alter the p-values in this case.
The residuals versus fits plot can be used to assess variance homogeneity.
There are no clear correlations between residuals and fitted values (the mean of each group) in the plot below, which is good. As a result, we can assume that the variances are homogeneous.
plot(res.aov, 1)
ANOVA (one-way) Outliers are discovered in R Points 17, 15, 4, which can
have a significant impact on normality and homogeneity of variance.
To meet the test assumptions, it can be beneficial to remove outliers.
The homogeneity of variances can also be checked using Bartlett’s or Levene’s tests.
Levene’s test is recommended since it is less sensitive to deviations from normal distribution. The leveneTest() method [from the car package] will be used:
library(car)
Warning: package 'car' was built under R version 4.2.1
Loading required package: carData
Warning: package 'carData' was built under R version 4.2.1
Attaching package: 'car'
The following object is masked from 'package:dplyr':
recode
leveneTest(weight ~ group, data = data)
Levene's Test for Homogeneity of Variance (center = median)
Df F value Pr(>F)
group 2 1.1192 0.3412
27
The p-value is not less than the significance level of 0.05, as seen in the output above. This indicates that there is no indication that the variance across groups is statistically significant.
As a result, we can infer that the variations in the different treatment groups are homogeneous.
ANOVA test with no equal variance assumption
oneway.test(weight ~ group, data = data)
One-way analysis of means (not assuming equal variances)
data: weight and group
F = 5.181, num df = 2.000, denom df = 17.128, p-value = 0.01739
Pairwise t-tests with no assumption of equal variances
pairwise.t.test(data$weight, data$group,
p.adjust.method = "BH", pool.sd = FALSE)
Pairwise comparisons using t tests with non-pooled SD
data: data$weight and data$group
ctrl trt1
trt1 0.250 -
trt2 0.072 0.028
P value adjustment method: BH
The residuals’ normal probability plot is used to verify that the residuals are normally distributed. It should be about in a straight line.
plot(res.aov, 2)
ANOVA (one-way) R-based test
We can infer normality because all of the points lie roughly along this reference line.
aov_residuals <- residuals(object = res.aov )
Run Shapiro-Wilk test
Shapiro-Wilk normality test
shapiro.test(x = aov_residuals )
Shapiro-Wilk normality test
data: aov_residuals
W = 0.96607, p-value = 0.4379
The Shapiro-Wilk test on the ANOVA residuals (W = 0.96, p = 0.43), which finds no evidence of normality violation, supports the previous conclusion.
ANOVA test with a non-parametric alternative
The Kruskal-Wallis rank-sum test is a non-parametric alternative to one-way ANOVA that can be employed when the ANOVA assumptions are not met.
kruskal.test(weight ~ group, data = data)
Kruskal-Wallis rank sum test
data: weight by group
Kruskal-Wallis chi-squared = 7.9882, df = 2, p-value = 0.01842