Introduction

A one-way ANOVA is a statistical method to compare means across three or more independent groups to determine if at least one group mean significantly differs from the others. This analysis is crucial when dealing with categorical independent variables and a continuous dependent variable.

Descriptives important for ANOVA

For our ANOVA example, we’ll employ ‘proficiency level’ as the categorical predictor variable and ‘writing scores’ as the response variable. To simplify our analysis, we’ll focus on three out of the five proficiency levels present in our dataset.

# Load the libraries
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr)

# Load the dataset
dta_rw <- read.csv("readwrite2.csv")

It is essential to ensure that the ‘proficiency level’ variable is treated as categorical within R.

# Check the data type for the predictor
class(dta_rw$level)
## [1] "integer"

Uh oh, it’s not a factor. Let’s try that again.

dta_rw$level <- factor(dta_rw$level)
class(dta_rw$level)
## [1] "factor"
# Filter the data we need
dta_rw <- dta_rw %>%
  filter(level == 1 | level == 2 | level == 3)
  
# Verify the data is what we want. 
# Display the unique levels of the 'level' variable from the dataset 'dta_rw'
unique(dta_rw$level)
## [1] 3 2 1
## Levels: 1 2 3 4 5

The unique() function call on the level column of our dataset dta_rw returns the levels present as “1”, “2”, and “3”. The output also indicates the existence of five defined levels (1 to 5), of which we will be using only the first three for our one-way ANOVA analysis. This confirms that our predictor variable level is indeed a categorical factor with multiple levels. In R, factors are used to represent categorical variables and are essential for correctly specifying a model in ANOVA.

Let’s now compute the descriptives for the predictor!

dta_rw %>%
  group_by(level) %>%
  summarize(count = n(),
            mean = mean(writing),
            sd = sd(writing))
## # A tibble: 3 × 4
##   level count  mean    sd
##   <fct> <int> <dbl> <dbl>
## 1 1       226  10.3  2.13
## 2 2       342  13.3  2.19
## 3 3       524  16.5  2.28

Numbers are great but graphs may be more informative and clear. To this end, we’ll create an informative graph using the ggplot2 package. First, though, we will generate a new variable that includes more descriptive labels, enhancing the interpretability of our graph. Second, ensure that the factor levels are logically ordered, with ‘Novice’ preceding ‘Intermediate Low’, which in turn should come before ‘Intermediate High’.

dta_rw <- dta_rw %>%
  mutate(level_rename = factor(case_when(
    level == 1 ~ "Novice",
    level == 2 ~ "Intermediate Low",
    level == 3 ~ "Intermediate High"),
    levels = c("Novice", "Intermediate Low", "Intermediate High")))

library(ggplot2)

ggplot(data = dta_rw, aes(x = as.factor(level_rename), 
                             y = writing)) +
  geom_boxplot() + # add the boxplot
  geom_point(position = position_jitter(width = .1), alpha = .08, size = 2) + # add individual data points
  stat_summary(fun = mean, geom = "point", color = "darkred") + # add the mean as a point
  stat_summary(fun = mean, geom = "line", aes(group = "level_rename"), color = "darkred") + # add the line between groups
  stat_summary(fun.data = mean_cl_boot, geom = "errorbar", width = 0.3, color = "darkred") +  # add error bars 
  labs(x = "Proficiency level",
       y = "Writing scores") # rename x- and y-axis

Please ensure that the descriptive statistics presented align with the graphical representations and the results from the forthcoming one-way ANOVA. Consistency across these elements is key for accurate interpretation and discussion of our findings.

Running the ANOVA

The basic ANOVA function is aov. It’s actually built on top of lm. That is, it uses lm to calculate everything, but converts the output to standard ANOVA output.

output1 <- aov(writing ~ level, data = dta_rw)
summary(output1)
##               Df Sum Sq Mean Sq F value Pr(>F)    
## level          2   6333    3166   641.3 <2e-16 ***
## Residuals   1089   5376       5                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Let’s go through the logic of null hypothesis testing for the omnibus test for proficiency level.

  1. The test statistic is an F-ratio
  2. Assume null hypothesis is true: all means are the same

H0: μ1 = μ2 = μ3

  1. The sampling distribution requires some assumptions.
  • random sampling
  • independence of observations
  • normality
  • absence of outliers
  • homogeneity of variance

Under those assumptions, and assuming the null hypothesis is true, the sampling distribution of the F-ratio is an F distribution with

  • numerator df = # groups - 1 = 3 - 1 = 2
  • denominator df = N - # groups = 1092 - 3 = 1089
  1. The observed F-ratio is 641.3. The p-value is <2e-16. That is, the probability of getting an F this big or bigger is <2e-16, very very small (unlikely).
  2. The p-value is < α = .05, so we can reject null hypothesis. Means are statistically significantly different from each other. There is evidence that the (population) means are different.

Effect size

The best ANOVA effect size option is the anova_stats function in the sjstats package.

library(sjstats)

anova_stats(output1)
## etasq | partial.etasq | omegasq | partial.omegasq | epsilonsq | cohens.f
## ------------------------------------------------------------------------
## 0.541 |         0.541 |   0.540 |           0.540 |     0.540 |    1.085
##       |               |         |                 |           |         
## 
## etasq |      term |    sumsq |   df |   meansq | statistic | p.value | power
## ----------------------------------------------------------------------------
## 0.541 |     level | 6332.636 |    2 | 3166.318 |   641.339 |  < .001 |     1
##       | Residuals | 5376.443 | 1089 |    4.937 |           |         |

The output can be a little ugly because of line breaks. We can use the kable function from knitr and the kable_styling function from kableExtra to fix that.

knitr::kable(anova_stats(output1)) %>%
  kableExtra::kable_styling()
etasq partial.etasq omegasq partial.omegasq epsilonsq cohens.f term sumsq df meansq statistic p.value power
level 0.541 0.541 0.54 0.54 0.54 1.085 level 6332.636 2 3166.318 641.339 0 1
1 NA NA NA NA NA NA Residuals 5376.443 1089 4.937 NA NA NA

The most important effect sizes are \(\eta^2\), \(\omega^2\), and \(\epsilon^2\). You can get these as specific output from anova_stats.

anova_stats(output1)$etasq[1]
## [1] 0.541
anova_stats(output1)$omegasq[1]
## [1] 0.54
anova_stats(output1)$epsilonsq[1]
## [1] 0.54

Here, \(\eta^2\) is 0.541. Partial \(\eta^2\) is also 0.541 because there is only one predictor, so there is no partialling out any other variables.

\(\omega^2\) is 0.54.

\(\epsilon^2\) is 0.54.

According to the generally accepted guidelines for interpretation:

  • <.01 = negligible
  • .01 - .06 = small
  • .06 - .14 = medium
  • > .14 = large

All of the effect size estimates indicate the effect size is very large.

Homogeneity of variance

There are several tests of homogeneity of variance. The most common are the Bartlett, Levene, and Fligner tests. Levene is most common, but Bartlett tends to perform best. All have homogeneity of variance as the null hypothesis. They are easy to get from the car package.

library(car)
## Loading required package: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
## 
##     recode
## The following object is masked from 'package:purrr':
## 
##     some
bartlett.test(writing ~ level, data = dta_rw)
## 
##  Bartlett test of homogeneity of variances
## 
## data:  writing by level
## Bartlett's K-squared = 1.6871, df = 2, p-value = 0.4302
leveneTest(writing ~ level, data = dta_rw)
## Levene's Test for Homogeneity of Variance (center = median)
##         Df F value Pr(>F)
## group    2  0.3894 0.6776
##       1089
fligner.test(writing ~ level, data = dta_rw)
## 
##  Fligner-Killeen test of homogeneity of variances
## 
## data:  writing by level
## Fligner-Killeen:med chi-squared = 0.90732, df = 2, p-value = 0.6353

The p-values are >.05 for all three. There is no evidence for heterogeneity of variance.

Post-hoc tests

Post-hoc tests can be done with the pairwise.t.test function along with a choice for the p.adjust.method. LSD (Least Significant Differences) uses no p-value adjustment. It is considered overly liberal.

pairwise.t.test(dta_rw$writing,
                dta_rw$level,
                p.adjust.method = "none")
## 
##  Pairwise comparisons using t tests with pooled SD 
## 
## data:  dta_rw$writing and dta_rw$level 
## 
##   1      2     
## 2 <2e-16 -     
## 3 <2e-16 <2e-16
## 
## P value adjustment method: none

The output is kind of ugly but simple and succinct. Each element in the matrix output is the p-value for a pairwise comparison. For example, the p-value for comparing level 1 and level 2 is <2e-16. The post-hoc test with LSD says these three levels are different on the (population) mean of writing scores. But, LSD makes no adjustment for multiple comparisons.

Now let’s do a very conservative Bonferroni post-hoc test.

pairwise.t.test(dta_rw$writing,
                dta_rw$level,
                p.adjust.method = "bonferroni")
## 
##  Pairwise comparisons using t tests with pooled SD 
## 
## data:  dta_rw$writing and dta_rw$level 
## 
##   1      2     
## 2 <2e-16 -     
## 3 <2e-16 <2e-16
## 
## P value adjustment method: bonferroni

Now, all pairwise comparisons are still significant, which makes sense because the p-values we observed earlier are very small. But note that Bonferroni is very conservative - it makes a huge correction to make sure there is no chance that the FWER (familywise error rate) is above .05, at the expense of having little power. Here’s another common one - Benjamini-Hochberg.

pairwise.t.test(dta_rw$writing,
                dta_rw$level,
                p.adjust.method = "BH")
## 
##  Pairwise comparisons using t tests with pooled SD 
## 
## data:  dta_rw$writing and dta_rw$level 
## 
##   1      2     
## 2 <2e-16 -     
## 3 <2e-16 <2e-16
## 
## P value adjustment method: BH

The same. All pairwise comparisons are significant. Benjamini-Hochberg does a better balance of FWER and power than LSD or Benferroni.

In my experience, if you had to pick one post-hoc test that seems to do well under the most conditions, you would pick the Tukey test. But, it can’t be done with pairwise.t.test. Thankfully, there’s a nice package - multcomp - that does Tukey and a lot more. The syntax is weird, so if you do a Tukey post-hoc test, just copy and paste this code and change output to whatever you named your output from the aov function.

library(multcomp)
## Loading required package: mvtnorm
## Loading required package: survival
## Loading required package: TH.data
## Loading required package: MASS
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
## 
## Attaching package: 'TH.data'
## The following object is masked from 'package:MASS':
## 
##     geyser
ph<-glht(output1,linfct = mcp(level="Tukey"))

summary(ph)
## 
##   Simultaneous Tests for General Linear Hypotheses
## 
## Multiple Comparisons of Means: Tukey Contrasts
## 
## 
## Fit: aov(formula = writing ~ level, data = dta_rw)
## 
## Linear Hypotheses:
##            Estimate Std. Error t value Pr(>|t|)    
## 2 - 1 == 0   2.9356     0.1905   15.41   <2e-16 ***
## 3 - 1 == 0   6.1186     0.1768   34.60   <2e-16 ***
## 3 - 2 == 0   3.1830     0.1545   20.61   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## (Adjusted p values reported -- single-step method)

The first column is which levels are being compared (with null hypothesis that the difference in means is 0). The second column is the observed difference in the sample means. The third column is the estimated SE of the differences in means. The fourth column is the t-value for the mean difference. The fifth column is p-value after the Tukey adjustment.

We can calculate the confidence intervals for the differences in each comparison: Here is how you get them from the multcomp package.

confint(ph)
## 
##   Simultaneous Confidence Intervals
## 
## Multiple Comparisons of Means: Tukey Contrasts
## 
## 
## Fit: aov(formula = writing ~ level, data = dta_rw)
## 
## Quantile = 2.3424
## 95% family-wise confidence level
##  
## 
## Linear Hypotheses:
##            Estimate lwr    upr   
## 2 - 1 == 0 2.9356   2.4894 3.3817
## 3 - 1 == 0 6.1186   5.7044 6.5328
## 3 - 2 == 0 3.1830   2.8212 3.5448

When homogeneity of variance is violated, research suggests that Dunnett’s T3 is the best post-hoc test. It is an uncommon post-hoc test and requires another package, DescTools

library(DescTools)
## 
## Attaching package: 'DescTools'
## The following object is masked from 'package:car':
## 
##     Recode
DunnettTest(x = dta_rw$writing, g = dta_rw$level)
## 
##   Dunnett's test for comparing several treatments with a control :  
##     95% family-wise confidence level
## 
## $`1`
##         diff   lwr.ci   upr.ci    pval    
## 2-1 2.935569 2.517813 3.353325 2.2e-15 ***
## 3-1 6.118608 5.730791 6.506424 < 2e-16 ***
## 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

All comparisons are significant, as evidenced by the fact that none of the 95% confidence intervals include zero, indicating significance in all pairwise comparisons.

Alternatives to ANOVA when homogeneity of variance cannot be assumed

Welch’s one-way test is the most common version of ANOVA without an assumption of homogeneity of variance.

oneway.test(writing ~ level, data = dta_rw)
## 
##  One-way analysis of means (not assuming equal variances)
## 
## data:  writing and level
## F = 654.56, num df = 2.00, denom df = 580.57, p-value < 2.2e-16

The interpretation of the one-way test works exactly the same as the regular ANOVA. Like with Welch’s t-test vs. independent samples t-test, the price of not assuming homogeneity of variance is lower power.

The nonparametric version of ANOVA is the Kruskal-Wallis test. It makes:

  • no assumption of homogeneity of variance
  • no assumption of normality
  • no assumption of interval measurement, only ordinal

The price is lower power, sometimes substantially lower. Make your choice.

kruskal.test(writing ~ level, data = dta_rw)
## 
##  Kruskal-Wallis rank sum test
## 
## data:  writing by level
## Kruskal-Wallis chi-squared = 597.47, df = 2, p-value < 2.2e-16

There are versions of post-hoc tests following the Kruskal-Wallis test. The most common and, as far as I can tell from my limited review of the research, the best performing, is the Dunn test. It is done in the FSA package. Here, I’m doing the Dunn test with a Benjamini-Hochberg correction.

library(FSA)
## Registered S3 methods overwritten by 'FSA':
##   method       from
##   confint.boot car 
##   hist.boot    car
## ## FSA v0.9.6. See citation('FSA') if used in publication.
## ## Run fishR() for related website and fishR('IFAR') for related book.
## 
## Attaching package: 'FSA'
## The following object is masked from 'package:car':
## 
##     bootCase
## The following object is masked from 'package:sjstats':
## 
##     se
dunnTest(writing ~ level, data = dta_rw, method="bh")
## Dunn (1964) Kruskal-Wallis multiple comparison
##   p-values adjusted with the Benjamini-Hochberg method.
##   Comparison          Z       P.unadj         P.adj
## 1      1 - 2  -9.423624  4.357803e-21  4.357803e-21
## 2      1 - 3 -23.278454 7.328991e-120 2.198697e-119
## 3      2 - 3 -15.028200  4.798695e-51  7.198042e-51