A one-way ANOVA is a statistical method to compare means across three or more independent groups to determine if at least one group mean significantly differs from the others. This analysis is crucial when dealing with categorical independent variables and a continuous dependent variable.
For our ANOVA example, we’ll employ ‘proficiency level’ as the categorical predictor variable and ‘writing scores’ as the response variable. To simplify our analysis, we’ll focus on three out of the five proficiency levels present in our dataset.
# Load the libraries
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr)
# Load the dataset
dta_rw <- read.csv("readwrite2.csv")
It is essential to ensure that the ‘proficiency level’ variable is treated as categorical within R.
# Check the data type for the predictor
class(dta_rw$level)
## [1] "integer"
Uh oh, it’s not a factor. Let’s try that again.
dta_rw$level <- factor(dta_rw$level)
class(dta_rw$level)
## [1] "factor"
# Filter the data we need
dta_rw <- dta_rw %>%
filter(level == 1 | level == 2 | level == 3)
# Verify the data is what we want.
# Display the unique levels of the 'level' variable from the dataset 'dta_rw'
unique(dta_rw$level)
## [1] 3 2 1
## Levels: 1 2 3 4 5
The unique()
function call on the level column of our
dataset dta_rw
returns the levels present as “1”, “2”, and
“3”. The output also indicates the existence of five defined levels (1
to 5), of which we will be using only the first three for our one-way
ANOVA analysis. This confirms that our predictor variable
level
is indeed a categorical factor with multiple levels.
In R, factors are used to represent categorical variables and are
essential for correctly specifying a model in ANOVA.
Let’s now compute the descriptives for the predictor!
dta_rw %>%
group_by(level) %>%
summarize(count = n(),
mean = mean(writing),
sd = sd(writing))
## # A tibble: 3 × 4
## level count mean sd
## <fct> <int> <dbl> <dbl>
## 1 1 226 10.3 2.13
## 2 2 342 13.3 2.19
## 3 3 524 16.5 2.28
Numbers are great but graphs may be more informative and clear. To
this end, we’ll create an informative graph using the
ggplot2
package. First, though, we will generate a new
variable that includes more descriptive labels, enhancing the
interpretability of our graph. Second, ensure that the factor levels are
logically ordered, with ‘Novice’ preceding ‘Intermediate Low’, which in
turn should come before ‘Intermediate High’.
dta_rw <- dta_rw %>%
mutate(level_rename = factor(case_when(
level == 1 ~ "Novice",
level == 2 ~ "Intermediate Low",
level == 3 ~ "Intermediate High"),
levels = c("Novice", "Intermediate Low", "Intermediate High")))
library(ggplot2)
ggplot(data = dta_rw, aes(x = as.factor(level_rename),
y = writing)) +
geom_boxplot() + # add the boxplot
geom_point(position = position_jitter(width = .1), alpha = .08, size = 2) + # add individual data points
stat_summary(fun = mean, geom = "point", color = "darkred") + # add the mean as a point
stat_summary(fun = mean, geom = "line", aes(group = "level_rename"), color = "darkred") + # add the line between groups
stat_summary(fun.data = mean_cl_boot, geom = "errorbar", width = 0.3, color = "darkred") + # add error bars
labs(x = "Proficiency level",
y = "Writing scores") # rename x- and y-axis
Please ensure that the descriptive statistics presented align with the graphical representations and the results from the forthcoming one-way ANOVA. Consistency across these elements is key for accurate interpretation and discussion of our findings.
The basic ANOVA function is aov
. It’s actually built on
top of lm. That is, it uses lm to calculate everything, but converts the
output to standard ANOVA output.
output1 <- aov(writing ~ level, data = dta_rw)
summary(output1)
## Df Sum Sq Mean Sq F value Pr(>F)
## level 2 6333 3166 641.3 <2e-16 ***
## Residuals 1089 5376 5
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Let’s go through the logic of null hypothesis testing for the omnibus test for proficiency level.
H0: μ1 = μ2 = μ3
Under those assumptions, and assuming the null hypothesis is true, the sampling distribution of the F-ratio is an F distribution with
The best ANOVA effect size option is the anova_stats
function in the sjstats
package.
library(sjstats)
anova_stats(output1)
## etasq | partial.etasq | omegasq | partial.omegasq | epsilonsq | cohens.f
## ------------------------------------------------------------------------
## 0.541 | 0.541 | 0.540 | 0.540 | 0.540 | 1.085
## | | | | |
##
## etasq | term | sumsq | df | meansq | statistic | p.value | power
## ----------------------------------------------------------------------------
## 0.541 | level | 6332.636 | 2 | 3166.318 | 641.339 | < .001 | 1
## | Residuals | 5376.443 | 1089 | 4.937 | | |
The output can be a little ugly because of line breaks. We can use
the kable
function from knitr and the
kable_styling
function from kableExtra
to fix
that.
knitr::kable(anova_stats(output1)) %>%
kableExtra::kable_styling()
etasq | partial.etasq | omegasq | partial.omegasq | epsilonsq | cohens.f | term | sumsq | df | meansq | statistic | p.value | power | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
level | 0.541 | 0.541 | 0.54 | 0.54 | 0.54 | 1.085 | level | 6332.636 | 2 | 3166.318 | 641.339 | 0 | 1 |
1 | NA | NA | NA | NA | NA | NA | Residuals | 5376.443 | 1089 | 4.937 | NA | NA | NA |
The most important effect sizes are \(\eta^2\), \(\omega^2\), and \(\epsilon^2\). You can get these as specific
output from anova_stats
.
anova_stats(output1)$etasq[1]
## [1] 0.541
anova_stats(output1)$omegasq[1]
## [1] 0.54
anova_stats(output1)$epsilonsq[1]
## [1] 0.54
Here, \(\eta^2\) is 0.541. Partial \(\eta^2\) is also 0.541 because there is only one predictor, so there is no partialling out any other variables.
\(\omega^2\) is 0.54.
\(\epsilon^2\) is 0.54.
According to the generally accepted guidelines for interpretation:
All of the effect size estimates indicate the effect size is very large.
There are several tests of homogeneity of variance. The most common are the Bartlett, Levene, and Fligner tests. Levene is most common, but Bartlett tends to perform best. All have homogeneity of variance as the null hypothesis. They are easy to get from the car package.
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
## The following object is masked from 'package:purrr':
##
## some
bartlett.test(writing ~ level, data = dta_rw)
##
## Bartlett test of homogeneity of variances
##
## data: writing by level
## Bartlett's K-squared = 1.6871, df = 2, p-value = 0.4302
leveneTest(writing ~ level, data = dta_rw)
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 2 0.3894 0.6776
## 1089
fligner.test(writing ~ level, data = dta_rw)
##
## Fligner-Killeen test of homogeneity of variances
##
## data: writing by level
## Fligner-Killeen:med chi-squared = 0.90732, df = 2, p-value = 0.6353
The p-values are >.05 for all three. There is no evidence for heterogeneity of variance.
Post-hoc tests can be done with the pairwise.t.test
function along with a choice for the p.adjust.method. LSD (Least
Significant Differences) uses no p-value adjustment. It is considered
overly liberal.
pairwise.t.test(dta_rw$writing,
dta_rw$level,
p.adjust.method = "none")
##
## Pairwise comparisons using t tests with pooled SD
##
## data: dta_rw$writing and dta_rw$level
##
## 1 2
## 2 <2e-16 -
## 3 <2e-16 <2e-16
##
## P value adjustment method: none
The output is kind of ugly but simple and succinct. Each element in the matrix output is the p-value for a pairwise comparison. For example, the p-value for comparing level 1 and level 2 is <2e-16. The post-hoc test with LSD says these three levels are different on the (population) mean of writing scores. But, LSD makes no adjustment for multiple comparisons.
Now let’s do a very conservative Bonferroni post-hoc test.
pairwise.t.test(dta_rw$writing,
dta_rw$level,
p.adjust.method = "bonferroni")
##
## Pairwise comparisons using t tests with pooled SD
##
## data: dta_rw$writing and dta_rw$level
##
## 1 2
## 2 <2e-16 -
## 3 <2e-16 <2e-16
##
## P value adjustment method: bonferroni
Now, all pairwise comparisons are still significant, which makes sense because the p-values we observed earlier are very small. But note that Bonferroni is very conservative - it makes a huge correction to make sure there is no chance that the FWER (familywise error rate) is above .05, at the expense of having little power. Here’s another common one - Benjamini-Hochberg.
pairwise.t.test(dta_rw$writing,
dta_rw$level,
p.adjust.method = "BH")
##
## Pairwise comparisons using t tests with pooled SD
##
## data: dta_rw$writing and dta_rw$level
##
## 1 2
## 2 <2e-16 -
## 3 <2e-16 <2e-16
##
## P value adjustment method: BH
The same. All pairwise comparisons are significant. Benjamini-Hochberg does a better balance of FWER and power than LSD or Benferroni.
In my experience, if you had to pick one post-hoc test that seems to
do well under the most conditions, you would pick the Tukey test. But,
it can’t be done with pairwise.t.test. Thankfully, there’s a nice
package - multcomp
- that does Tukey and a lot more. The
syntax is weird, so if you do a Tukey post-hoc test, just copy and paste
this code and change output to whatever you named your output from the
aov
function.
library(multcomp)
## Loading required package: mvtnorm
## Loading required package: survival
## Loading required package: TH.data
## Loading required package: MASS
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
##
## Attaching package: 'TH.data'
## The following object is masked from 'package:MASS':
##
## geyser
ph<-glht(output1,linfct = mcp(level="Tukey"))
summary(ph)
##
## Simultaneous Tests for General Linear Hypotheses
##
## Multiple Comparisons of Means: Tukey Contrasts
##
##
## Fit: aov(formula = writing ~ level, data = dta_rw)
##
## Linear Hypotheses:
## Estimate Std. Error t value Pr(>|t|)
## 2 - 1 == 0 2.9356 0.1905 15.41 <2e-16 ***
## 3 - 1 == 0 6.1186 0.1768 34.60 <2e-16 ***
## 3 - 2 == 0 3.1830 0.1545 20.61 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## (Adjusted p values reported -- single-step method)
The first column is which levels are being compared (with null hypothesis that the difference in means is 0). The second column is the observed difference in the sample means. The third column is the estimated SE of the differences in means. The fourth column is the t-value for the mean difference. The fifth column is p-value after the Tukey adjustment.
We can calculate the confidence intervals for the differences in each
comparison: Here is how you get them from the multcomp
package.
confint(ph)
##
## Simultaneous Confidence Intervals
##
## Multiple Comparisons of Means: Tukey Contrasts
##
##
## Fit: aov(formula = writing ~ level, data = dta_rw)
##
## Quantile = 2.3424
## 95% family-wise confidence level
##
##
## Linear Hypotheses:
## Estimate lwr upr
## 2 - 1 == 0 2.9356 2.4894 3.3817
## 3 - 1 == 0 6.1186 5.7044 6.5328
## 3 - 2 == 0 3.1830 2.8212 3.5448
When homogeneity of variance is violated, research suggests that
Dunnett’s T3 is the best post-hoc test. It is an uncommon post-hoc test
and requires another package, DescTools
library(DescTools)
##
## Attaching package: 'DescTools'
## The following object is masked from 'package:car':
##
## Recode
DunnettTest(x = dta_rw$writing, g = dta_rw$level)
##
## Dunnett's test for comparing several treatments with a control :
## 95% family-wise confidence level
##
## $`1`
## diff lwr.ci upr.ci pval
## 2-1 2.935569 2.517813 3.353325 2.2e-15 ***
## 3-1 6.118608 5.730791 6.506424 < 2e-16 ***
##
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
All comparisons are significant, as evidenced by the fact that none of the 95% confidence intervals include zero, indicating significance in all pairwise comparisons.
Welch’s one-way test is the most common version of ANOVA without an assumption of homogeneity of variance.
oneway.test(writing ~ level, data = dta_rw)
##
## One-way analysis of means (not assuming equal variances)
##
## data: writing and level
## F = 654.56, num df = 2.00, denom df = 580.57, p-value < 2.2e-16
The interpretation of the one-way test works exactly the same as the regular ANOVA. Like with Welch’s t-test vs. independent samples t-test, the price of not assuming homogeneity of variance is lower power.
The nonparametric version of ANOVA is the Kruskal-Wallis test. It makes:
The price is lower power, sometimes substantially lower. Make your choice.
kruskal.test(writing ~ level, data = dta_rw)
##
## Kruskal-Wallis rank sum test
##
## data: writing by level
## Kruskal-Wallis chi-squared = 597.47, df = 2, p-value < 2.2e-16
There are versions of post-hoc tests following the Kruskal-Wallis
test. The most common and, as far as I can tell from my limited review
of the research, the best performing, is the Dunn test. It is done in
the FSA
package. Here, I’m doing the Dunn test with a
Benjamini-Hochberg correction.
library(FSA)
## Registered S3 methods overwritten by 'FSA':
## method from
## confint.boot car
## hist.boot car
## ## FSA v0.9.6. See citation('FSA') if used in publication.
## ## Run fishR() for related website and fishR('IFAR') for related book.
##
## Attaching package: 'FSA'
## The following object is masked from 'package:car':
##
## bootCase
## The following object is masked from 'package:sjstats':
##
## se
dunnTest(writing ~ level, data = dta_rw, method="bh")
## Dunn (1964) Kruskal-Wallis multiple comparison
## p-values adjusted with the Benjamini-Hochberg method.
## Comparison Z P.unadj P.adj
## 1 1 - 2 -9.423624 4.357803e-21 4.357803e-21
## 2 1 - 3 -23.278454 7.328991e-120 2.198697e-119
## 3 2 - 3 -15.028200 4.798695e-51 7.198042e-51