One-way ANOVA – comparing more than 2 means

This document was composed from Dr. Snopkowski’s ANTH 504 Week 8 lecture and Danielle Navarro’s 2021 Learning statistics with R Chapter 14.

• When we want to compare means we can use a t-test. This test has limitations:

• You can compare only 2 means: often we would like to compare means from 3 or more groups.

• Running many t-tests inflates the error rate.

1 vs 2, 1 vs 3, 2 vs 3

• ANOVA

• Compares several means.

• It is an extension of regression (the general linear model).

inflate type 1 error

Why Not Use Lots of t-Tests?

• If we want to compare several means why don’t we compare pairs of means with t-tests?

• Can’t look at several independent variables.

• Inflates the Type I error rate.

What Does ANOVA Tell Us?

• Null hyothesis:

• Like a t-test, ANOVA tests the null hypothesis that the means are the same.

mean of group 1 is equal to the mean of group 2 which is also equal to the mean of group 3.

• Experimental hypothesis:

• The means differ.

• ANOVA is an omnibus test: Any of the means are different.

• It tests for an overall difference between groups.

• It tells us that the group means are different.

• It doesn’t tell us exactly which means differ.

Theory of ANOVA

• We calculate how much variability there is between scores.

• Total Sum of Squares (SST).

• We then calculate how much of this variability can be explained by the model we fit to the data

• How much variability is due to the experimental manipulation, model sum of squares (SSM or SSb) between-group sum of squares

• . . . and how much cannot be explained

• How much variability is due to individual differences in performance, residual sum of squares (SSR or SSW).

Within-group sum of squares: how different each individual person is from their own group mean.

• Instead of comparing individuals to the average of all people in the experiment, we’re only comparing them to those people in the the same group.

Within Group Variation

“If the between-group variation is SSb is large relative to the within-group variation SS w then we have reason to suspect that the population means for the different groups aren’t identical to each other.”

• We compare the amount of variability explained by the model (experiment), to the error in the model (individual differences)

• SSW + SSb = SST or SSM + SSR = SST

This ratio is called the F-ratio.

• If the model explains a lot more variability than it can’t explain, then the experimental manipulation has had a significant effect on the outcome (DV). (Reject the null)

Theory of ANOVA

• If the experiment is successful, then the model will explain more variance than it can’t

• SSM will be greater than SSR

ANOVA by Hand

• Testing the effects of puppy therapy on happiness using three groups:

• Control (no puppy)

• 15 minutes

• 30 minutes

• The outcome/dependent variable (DV) was an objective measure of happiness

The Data

“how much variation we have among this group”

Top left SST: How far away is each point away from the grand mean?

Each group has it’s own mean. How much of of the total variation does that explain? How much of the total variation that exists can be represented by these different means?

Residual: how much does each data point vary from it’s group mean? That is what is left over after we include these group means.

If the group means are a little bit better at representing this data than how much variation still exists, we should get a reasonable p-value and that will say that our group means are not equal. There is at least one mean that is not equal.

Calculate the Mean Squared Error

the variation explained by the model compared to the variation explained by the unsystematic factors.

Calculate the F-Ratio

Standard ANOVA table

Assumptions of ANOVA

  1. Homogeneity of variance across groups:
  • The variance of the groups are approximately the same.

  • Levene’s test leveneTest() with ~

  1. Observations should be independent

  2. The dependent variable needs to be measured on at least interval scale.

*Nothing to test here. You just can’t do it unless you have continuous data.

  1. Within groups, the distributions are normal.
  • Shapiro-Wilk Test

Let’s enter the data into R

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0      ✔ purrr   1.0.1 
## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.2.1      ✔ stringr 1.5.0 
## ✔ readr   2.1.3      ✔ forcats 0.5.2 
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
ind = rep(c("control", "min15", "min30"), each = 5)
dep = c(3, 2, 1, 1, 4, 5, 2, 4, 2, 3, 7, 4, 5, 3, 6)
data <- bind_rows("puppy" = factor(ind), "happy" = dep)
data
## # A tibble: 15 × 2
##    puppy   happy
##    <fct>   <dbl>
##  1 control     3
##  2 control     2
##  3 control     1
##  4 control     1
##  5 control     4
##  6 min15       5
##  7 min15       2
##  8 min15       4
##  9 min15       2
## 10 min15       3
## 11 min30       7
## 12 min30       4
## 13 min30       5
## 14 min30       3
## 15 min30       6
sapply(data, class)
##     puppy     happy 
##  "factor" "numeric"

First step is to do calculate some descriptive statistics or visualize that data. Can you create boxplots?

Boxplots

data %>% ggplot(aes(puppy, happy)) + 
  geom_boxplot()

We could calculate the means for each as well.

Test Assumptions

There are three key assumptions: normality, homogeneity of variance and independence.

  1. Homogeneity of variance across groups
  • There is one value for the population standard deviation (i.e.,σ), rather than allowing each group to have it’s own value (i.e.,σk). This is referred to as the homogeneity of variance (sometimes called homoscedasticity) assumption. ANOVA assumes that the population standard deviation is the same for all groups. Remember to do dep ~ ind.
library(car)
## Loading required package: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
## 
##     recode
## The following object is masked from 'package:purrr':
## 
##     some
leveneTest(data$happy ~ as.factor(data$puppy))
## Levene's Test for Homogeneity of Variance (center = median)
##       Df F value Pr(>F)
## group  2  0.1176   0.89
##       12

The p-value is greater than 0.05 (p-value = 0.89), therefore we do not have evidence to conclude that the variances are different.

  1. Observations should be independent
  • The independence assumption means is that, knowing one residual tells you nothing about any other residual. “All of the ϵ~ik~ values are assumed to have been generated without any “regard for” or “relationship to” any of the other ones. There’s not an obvious or simple way to test for this, but there are some situations that are clear violations of this: for instance, if you have a repeated-measures design, where each participant in your study appears in more than one condition, then independence doesn’t hold; there’s a special relationship between some observations… namely those that correspond to the same person! When that happens, you need to use something like repeated measures ANOVA. I don’t currently talk about repeated measures ANOVA in this book, but it will be included in later versions.”
  1. The dependent variable needs to be measured on at least interval scale.

  2. Within groups, the distributions are normal.

  • The residuals are assumed to be normally distributed. We assess this by looking at QQ plots or running a Shapiro-Wilk test.
control <- data %>% filter(puppy == "control")
shapiro.test(control$happy)
## 
##  Shapiro-Wilk normality test
## 
## data:  control$happy
## W = 0.90202, p-value = 0.4211
hist(control$happy)

The p-value is greater than 0.05 (p-value = 0.4211), therefore we do not have evidence to conclude that the data is not normal. We fail to reject the null that the data is normal.

min_15 <- data %>% filter(puppy == "min15")
shapiro.test(min_15$happy)
## 
##  Shapiro-Wilk normality test
## 
## data:  min_15$happy
## W = 0.90202, p-value = 0.4211
hist(min_15$happy)

min_30 <- data %>% filter(puppy == "min30")
shapiro.test(min_30$happy)
## 
##  Shapiro-Wilk normality test
## 
## data:  min_30$happy
## W = 0.98676, p-value = 0.9672
hist(min_30$happy)

Or, you can use a for loop like this:

# Loop through each type of the "puppy" variable and create a histogram of the "happy" column
for (pup_play_time in levels(data$puppy)) {
  puppy_subset <- subset(data, puppy == pup_play_time)
  hist(puppy_subset$happy)
}

# for loop through each level of the "puppy" variable and run shapiro.test() on the "happy" column
for (pup_play_time in levels(data$puppy)) {
  puppy_subset <- subset(data, puppy == pup_play_time)
  test_result <- shapiro.test(puppy_subset$happy)
  cat(paste("Puppy_time:", pup_play_time, "- W = ", round(test_result$statistic, 3), ", p-value = ", round(test_result$p.value, 3), "\n"))
}
## Puppy_time: control - W =  0.902 , p-value =  0.421 
## Puppy_time: min15 - W =  0.902 , p-value =  0.421 
## Puppy_time: min30 - W =  0.987 , p-value =  0.967

ANOVA in R

Used aov() to calculate the ANOVA. There are several arguments to the aov() function, but the only two that we’re interested in are formula and data.It’s generally a good idea to create a variable like my.anova that stores the output of the aov() function… because later on, you can use my.anova as an input to lots of other functions

anovaTest <- aov(happy ~ puppy, data=data)
summary(anovaTest)
##             Df Sum Sq Mean Sq F value Pr(>F)  
## puppy        2  20.13  10.067   5.119 0.0247 *
## Residuals   12  23.60   1.967                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The p-value = 0.0247. The * tells you that the p-value is significant. The F-value is 5.119. The degrees of freedom is 2.

We run the ANOVA using the funcition aov() and store the results as anovaTest, then further test the assumptions using these results. The following is a method of testing the normality assumption after running the aov() function.

# extract the residuals
my.anova.residuals <- residuals( object = anovaTest )

# run Shapiro-Wilk test
shapiro.test( x = my.anova.residuals ) 
## 
##  Shapiro-Wilk normality test
## 
## data:  my.anova.residuals
## W = 0.91669, p-value = 0.1715
# plot a histogram
hist( x = my.anova.residuals )

# draw a QQ plot 
qqnorm( y = my.anova.residuals )         

The histogram is not very normal looking but let’s just pretend for this example. The QQ plot looks normal. This is supported by the results of our Shapiro-Wilk test (W= 0.91669 , p-value = 0.1715 ) which finds no indication that normality is violated.

The Kruskal-Wallis rank sum test is a non-parametric test for three of more groups. This is described in a later section of these notes.

Why Use Follow-Up Tests?

We need to know which group is different.

• The F-ratio tells us only that the experiment was successful

• i.e. group means were different

• It does not tell us specifically which group means differ from which.

• We need additional tests to find out where the group differences lie.

How?

• Multiple t-tests

• We saw earlier that this is a bad idea

• Orthogonal contrasts/comparisons

• Hypothesis driven

• Planned a priori

• Post hoc tests

• Not planned (no hypothesis)

• Compare all pairs of means

• Trend analysis

Usually the type of follow-up test I use!

Post Hoc Tests

• Compare each mean against all others.

• In general terms they use a stricter criterion to accept an effect as significant.

• Hence, control the familywise error rate.

• Simplest example is the Bonferroni method:

Correction for Multiple Comparisons

What corrections do you know?

Bonferroni correction: divide alpha by number of comparisons: .05/3 = 0.0167

pairwise.t.test(data$happy, data$puppy, p.adj="bonf")
## 
##  Pairwise comparisons using t tests with pooled SD 
## 
## data:  data$happy and data$puppy 
## 
##       control min15
## min15 0.845   -    
## min30 0.025   0.196
## 
## P value adjustment method: bonferroni
library(lsr)
posthocPairwiseT(anovaTest, p.adjust.method="bonferroni")
## 
##  Pairwise comparisons using t tests with pooled SD 
## 
## data:  happy and puppy 
## 
##       control min15
## min15 0.845   -    
## min30 0.025   0.196
## 
## P value adjustment method: bonferroni

Bonferroni is a very conservative test – generally, people like to use other tests.

Holm correction: pretend that you’re doing the tests sequentially; starting with the smallest p-value and moving onto the largest one. See details in your book. It has the same Type 1 error rate, but it has more power than Bonferroni.

pairwise.t.test(data$happy, data$puppy, p.adj="holm")
## 
##  Pairwise comparisons using t tests with pooled SD 
## 
## data:  data$happy and data$puppy 
## 
##       control min15
## min15 0.282   -    
## min30 0.025   0.130
## 
## P value adjustment method: holm
posthocPairwiseT(anovaTest, p.adjust.method="holm")
## 
##  Pairwise comparisons using t tests with pooled SD 
## 
## data:  happy and puppy 
## 
##       control min15
## min15 0.282   -    
## min30 0.025   0.130
## 
## P value adjustment method: holm

Holm correction is the default one used by pairwise.t.test() and posthocPairwiseT() and adding p.adjust.method="holm" is not necessary.

Post hoc tests using a Holm correction for multiple comparisons were conducted to analyze the pairwise differences between the “happy” and “puppy” groups. The results revealed that there was no significant difference in the mood change between the “happy” and “puppy” conditions when comparing 15 minutes and control(p = 0.282), or when comparing min30 and min15 (p = 0.130). There was a significant difference inthe mood change between the “happy” and “puppy” conditions when comparing min30 and control (p = 0.025).

Kruskal–Wallis test

Differences between several independent groups

• The Kruskal–Wallis test (Kruskal & Wallis, 1952) is the non-parametric counterpart of the one-way independent ANOVA.

• The theory is very similar to that of the Mann–Whitney (and Wilcoxon rank-sum) test:

• It is based on ranked data.

• The sum of ranks for each group is denoted by Ri (where i is used to denote the particular group).

Example

• Does eating soya affect your sperm count?

• Variables

• Outcome: sperm (millions)

• IV: Number of soya meals per week

• No Soya meals

• 1 Soya meal

• 4 soya meals

• 7 soya meals

• Participants

• 80 males (20 in each group)

Data for the Soya Example with Ranks

Kruskal-Wallis test Theory

• Once the sum of ranks has been calculated for each group. The test statistic, H, is calculated as:

• Ri is the sum of ranks for each group.

• N is the total sample size.

• ni is the sample size of a particular group.

Non-parametric ANOVA: Kruskal -Wallis

If the assumptions of ANOVA are not met, run the Kruskal-Wallis test

kruskal.test(happy ~ puppy, data=data)
## 
##  Kruskal-Wallis rank sum test
## 
## data:  happy by puppy
## Kruskal-Wallis chi-squared = 6.2, df = 2, p-value = 0.04505
pairwise.wilcox.test(data$happy, data$puppy, p.adjust.method="holm", exact=FALSE)
## 
##  Pairwise comparisons using Wilcoxon rank sum test with continuity correction 
## 
## data:  data$happy and data$puppy 
## 
##       control min15
## min15 0.29    -    
## min30 0.11    0.23 
## 
## P value adjustment method: holm

Communication

In an experiment to examine the effect of exposure to puppies on happiness, 15 people were assigned to 3 conditions: no exposure to puppies (control), 15 minutes of puppies, and 30 minutes of puppies. The mean happiness was lowest for the control group (M=3.2, SD=1.3), higher for the 15-minute condition (M=3.2, SD=1.3) and highest for the 30-minute condition (M=5, SD=1.58). An ANOVA test to compare means across groups was conducted. There was a significant effect of puppy therapy on levels of happiness, F(2, 12)=5.12, p=0.025.Post hoc comparisons using the Holm test indicated that the mean score for the control condition was significantly different from the 30-minute Condition. However, the 15-minute condition did not significantly differ from the control condition and the 30-minute condition.

or

In a study to examine ATV rider (ATV_type) and the number of years the rider has been riding (Yrs_riding), 77 people where classified into one of three categories: Active, Casual, and Dedicated. The mean years riding was lowest for the Casual classification (M = 8.32, SD = 4.10), higher for Active classification (M = 15.16, SD = 4.15) and the highest for the Dedicated classification (M = 21.25, SD = 3.72). A Kruskal-Wallis rank sum test was conducted to examine the effect these drugs effect on mood gain. The main effect of the ATV type of rider was significant, K(2) = 51.014, p-value = 8.36e-12. I reject the null that the medians are equal in favor of the alternative. At least one ATV type of rider has a median difference from the other classifications. In other words, I have evidence to conclude that there are differences between the medians of the groups. Post-hoc pairwise comparisons using Wilcoxon rank sum test with continuity correction revealed that the median years riding for Dedicated was significantly higher than that for Casual (p = 5.5e-09) and Active (p = 6.5e-06). Also the median years riding for Active was significantly higher than that for Casual (p = 6.5e-06). These results suggest that there are significant differences in years riding among the three types of ATV riders.