This document was composed from Dr. Snopkowski’s ANTH 504 Week 8 lecture and Danielle Navarro’s 2021 Learning statistics with R Chapter 14.
• When we want to compare means we can use a t-test. This test has limitations:
• You can compare only 2 means: often we would like to compare means from 3 or more groups.
• Running many t-tests inflates the error rate.
1 vs 2, 1 vs 3, 2 vs 3
• ANOVA
• Compares several means.
• It is an extension of regression (the general linear model).
inflate type 1 error
• If we want to compare several means why don’t we compare pairs of means with t-tests?
• Can’t look at several independent variables.
• Inflates the Type I error rate.
What Does ANOVA Tell Us?
• Null hyothesis:
• Like a t-test, ANOVA tests the null hypothesis that the means are the same.
mean of group 1 is equal to the mean of group 2 which is also equal to the mean of group 3.
• Experimental hypothesis:
• The means differ.
• ANOVA is an omnibus test: Any of the means are different.
• It tests for an overall difference between groups.
• It tells us that the group means are different.
• It doesn’t tell us exactly which means differ.
• We calculate how much variability there is between scores.
• Total Sum of Squares (SST).
• We then calculate how much of this variability can be explained by the model we fit to the data
• How much variability is due to the experimental manipulation, model sum of squares (SSM or SSb) between-group sum of squares
• . . . and how much cannot be explained
• How much variability is due to individual differences in performance, residual sum of squares (SSR or SSW).
• Within-group sum of squares: how different each individual person is from their own group mean.
• Instead of comparing individuals to the average of all people in the experiment, we’re only comparing them to those people in the the same group.
Within Group Variation
“If the between-group variation is SSb is large relative to the within-group variation SS w then we have reason to suspect that the population means for the different groups aren’t identical to each other.”
• We compare the amount of variability explained by the model (experiment), to the error in the model (individual differences)
• SSW + SSb = SST or SSM + SSR = SST
This ratio is called the F-ratio.
• If the model explains a lot more variability than it can’t explain, then the experimental manipulation has had a significant effect on the outcome (DV). (Reject the null)
Theory of ANOVA
• If the experiment is successful, then the model will explain more variance than it can’t
• SSM will be greater than SSR
• Testing the effects of puppy therapy on happiness using three groups:
• Control (no puppy)
• 15 minutes
• 30 minutes
• The outcome/dependent variable (DV) was an objective measure of happiness
The Data
“how much variation we have among this group”
Top left SST: How far away is each point away from the grand mean?
Each group has it’s own mean. How much of of the total variation does that explain? How much of the total variation that exists can be represented by these different means?
Residual: how much does each data point vary from it’s group mean? That is what is left over after we include these group means.
If the group means are a little bit better at representing this data than how much variation still exists, we should get a reasonable p-value and that will say that our group means are not equal. There is at least one mean that is not equal.
the variation explained by the model compared to the variation explained by the unsystematic factors.
Standard ANOVA table
The variance of the groups are approximately the same.
Levene’s test leveneTest() with
~
Observations should be independent
The dependent variable needs to be measured on at least interval scale.
*Nothing to test here. You just can’t do it unless you have continuous data.
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0 ✔ purrr 1.0.1
## ✔ tibble 3.1.8 ✔ dplyr 1.0.10
## ✔ tidyr 1.2.1 ✔ stringr 1.5.0
## ✔ readr 2.1.3 ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
ind = rep(c("control", "min15", "min30"), each = 5)
dep = c(3, 2, 1, 1, 4, 5, 2, 4, 2, 3, 7, 4, 5, 3, 6)
data <- bind_rows("puppy" = factor(ind), "happy" = dep)
data
## # A tibble: 15 × 2
## puppy happy
## <fct> <dbl>
## 1 control 3
## 2 control 2
## 3 control 1
## 4 control 1
## 5 control 4
## 6 min15 5
## 7 min15 2
## 8 min15 4
## 9 min15 2
## 10 min15 3
## 11 min30 7
## 12 min30 4
## 13 min30 5
## 14 min30 3
## 15 min30 6
sapply(data, class)
## puppy happy
## "factor" "numeric"
First step is to do calculate some descriptive statistics or visualize that data. Can you create boxplots?
data %>% ggplot(aes(puppy, happy)) +
geom_boxplot()
We could calculate the means for each as well.
There are three key assumptions: normality, homogeneity of variance and independence.
dep ~ ind.library(car)
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
## The following object is masked from 'package:purrr':
##
## some
leveneTest(data$happy ~ as.factor(data$puppy))
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 2 0.1176 0.89
## 12
The p-value is greater than 0.05 (p-value = 0.89), therefore we do not have evidence to conclude that the variances are different.
ϵ~ik~ values are assumed to have been generated without any
“regard for” or “relationship to” any of the other ones. There’s not an
obvious or simple way to test for this, but there are some situations
that are clear violations of this: for instance, if you have a
repeated-measures design, where each participant in your study appears
in more than one condition, then independence doesn’t hold; there’s a
special relationship between some observations… namely those that
correspond to the same person! When that happens, you need to use
something like repeated measures ANOVA. I don’t currently talk about
repeated measures ANOVA in this book, but it will be included in later
versions.”The dependent variable needs to be measured on at least interval scale.
Within groups, the distributions are normal.
control <- data %>% filter(puppy == "control")
shapiro.test(control$happy)
##
## Shapiro-Wilk normality test
##
## data: control$happy
## W = 0.90202, p-value = 0.4211
hist(control$happy)
The p-value is greater than 0.05 (p-value = 0.4211), therefore we do not have evidence to conclude that the data is not normal. We fail to reject the null that the data is normal.
min_15 <- data %>% filter(puppy == "min15")
shapiro.test(min_15$happy)
##
## Shapiro-Wilk normality test
##
## data: min_15$happy
## W = 0.90202, p-value = 0.4211
hist(min_15$happy)
min_30 <- data %>% filter(puppy == "min30")
shapiro.test(min_30$happy)
##
## Shapiro-Wilk normality test
##
## data: min_30$happy
## W = 0.98676, p-value = 0.9672
hist(min_30$happy)
Or, you can use a for loop like this:
# Loop through each type of the "puppy" variable and create a histogram of the "happy" column
for (pup_play_time in levels(data$puppy)) {
puppy_subset <- subset(data, puppy == pup_play_time)
hist(puppy_subset$happy)
}
# for loop through each level of the "puppy" variable and run shapiro.test() on the "happy" column
for (pup_play_time in levels(data$puppy)) {
puppy_subset <- subset(data, puppy == pup_play_time)
test_result <- shapiro.test(puppy_subset$happy)
cat(paste("Puppy_time:", pup_play_time, "- W = ", round(test_result$statistic, 3), ", p-value = ", round(test_result$p.value, 3), "\n"))
}
## Puppy_time: control - W = 0.902 , p-value = 0.421
## Puppy_time: min15 - W = 0.902 , p-value = 0.421
## Puppy_time: min30 - W = 0.987 , p-value = 0.967
Used aov() to calculate the ANOVA. There are several
arguments to the aov() function, but the only two that
we’re interested in are formula and data.It’s
generally a good idea to create a variable like my.anova that stores the
output of the aov() function… because later on, you can use my.anova as
an input to lots of other functions
anovaTest <- aov(happy ~ puppy, data=data)
summary(anovaTest)
## Df Sum Sq Mean Sq F value Pr(>F)
## puppy 2 20.13 10.067 5.119 0.0247 *
## Residuals 12 23.60 1.967
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The p-value = 0.0247. The * tells you that the p-value
is significant. The F-value is 5.119. The degrees of freedom is 2.
We run the ANOVA using the funcition aov() and store the
results as anovaTest, then further test the assumptions
using these results. The following is a method of testing the normality
assumption after running the aov() function.
# extract the residuals
my.anova.residuals <- residuals( object = anovaTest )
# run Shapiro-Wilk test
shapiro.test( x = my.anova.residuals )
##
## Shapiro-Wilk normality test
##
## data: my.anova.residuals
## W = 0.91669, p-value = 0.1715
# plot a histogram
hist( x = my.anova.residuals )
# draw a QQ plot
qqnorm( y = my.anova.residuals )
The histogram is not very normal looking but let’s just pretend for this example. The QQ plot looks normal. This is supported by the results of our Shapiro-Wilk test (W= 0.91669 , p-value = 0.1715 ) which finds no indication that normality is violated.
The Kruskal-Wallis rank sum test is a non-parametric test for three of more groups. This is described in a later section of these notes.
We need to know which group is different.
• The F-ratio tells us only that the experiment was successful
• i.e. group means were different
• It does not tell us specifically which group means differ from which.
• We need additional tests to find out where the group differences lie.
• Multiple t-tests
• We saw earlier that this is a bad idea
• Orthogonal contrasts/comparisons
• Hypothesis driven
• Planned a priori
• Post hoc tests
• Not planned (no hypothesis)
• Compare all pairs of means
• Trend analysis
Usually the type of follow-up test I use!
• Compare each mean against all others.
• In general terms they use a stricter criterion to accept an effect as significant.
• Hence, control the familywise error rate.
• Simplest example is the Bonferroni method:
What corrections do you know?
Bonferroni correction: divide alpha by number of comparisons:
.05/3 = 0.0167
pairwise.t.test(data$happy, data$puppy, p.adj="bonf")
##
## Pairwise comparisons using t tests with pooled SD
##
## data: data$happy and data$puppy
##
## control min15
## min15 0.845 -
## min30 0.025 0.196
##
## P value adjustment method: bonferroni
library(lsr)
posthocPairwiseT(anovaTest, p.adjust.method="bonferroni")
##
## Pairwise comparisons using t tests with pooled SD
##
## data: happy and puppy
##
## control min15
## min15 0.845 -
## min30 0.025 0.196
##
## P value adjustment method: bonferroni
Bonferroni is a very conservative test – generally, people like to use other tests.
Holm correction: pretend that you’re doing the tests sequentially; starting with the smallest p-value and moving onto the largest one. See details in your book. It has the same Type 1 error rate, but it has more power than Bonferroni.
pairwise.t.test(data$happy, data$puppy, p.adj="holm")
##
## Pairwise comparisons using t tests with pooled SD
##
## data: data$happy and data$puppy
##
## control min15
## min15 0.282 -
## min30 0.025 0.130
##
## P value adjustment method: holm
posthocPairwiseT(anovaTest, p.adjust.method="holm")
##
## Pairwise comparisons using t tests with pooled SD
##
## data: happy and puppy
##
## control min15
## min15 0.282 -
## min30 0.025 0.130
##
## P value adjustment method: holm
Holm correction is the default one used by
pairwise.t.test() and posthocPairwiseT() and
adding p.adjust.method="holm" is not necessary.
Post hoc tests using a Holm correction for multiple comparisons were conducted to analyze the pairwise differences between the “happy” and “puppy” groups. The results revealed that there was no significant difference in the mood change between the “happy” and “puppy” conditions when comparing 15 minutes and control(p = 0.282), or when comparing min30 and min15 (p = 0.130). There was a significant difference inthe mood change between the “happy” and “puppy” conditions when comparing min30 and control (p = 0.025).
Differences between several independent groups
• The Kruskal–Wallis test (Kruskal & Wallis, 1952) is the non-parametric counterpart of the one-way independent ANOVA.
• The theory is very similar to that of the Mann–Whitney (and Wilcoxon rank-sum) test:
• It is based on ranked data.
• The sum of ranks for each group is denoted by Ri (where i is used to denote the particular group).
• Does eating soya affect your sperm count?
• Variables
• Outcome: sperm (millions)
• IV: Number of soya meals per week
• No Soya meals
• 1 Soya meal
• 4 soya meals
• 7 soya meals
• Participants
• 80 males (20 in each group)
• Once the sum of ranks has been calculated for each group. The test statistic, H, is calculated as:
• Ri is the sum of ranks for each group.
• N is the total sample size.
• ni is the sample size of a particular group.
If the assumptions of ANOVA are not met, run the Kruskal-Wallis test
kruskal.test(happy ~ puppy, data=data)
##
## Kruskal-Wallis rank sum test
##
## data: happy by puppy
## Kruskal-Wallis chi-squared = 6.2, df = 2, p-value = 0.04505
pairwise.wilcox.test(data$happy, data$puppy, p.adjust.method="holm", exact=FALSE)
##
## Pairwise comparisons using Wilcoxon rank sum test with continuity correction
##
## data: data$happy and data$puppy
##
## control min15
## min15 0.29 -
## min30 0.11 0.23
##
## P value adjustment method: holm
In an experiment to examine the effect of exposure to puppies on happiness, 15 people were assigned to 3 conditions: no exposure to puppies (control), 15 minutes of puppies, and 30 minutes of puppies. The mean happiness was lowest for the control group (M=3.2, SD=1.3), higher for the 15-minute condition (M=3.2, SD=1.3) and highest for the 30-minute condition (M=5, SD=1.58). An ANOVA test to compare means across groups was conducted. There was a significant effect of puppy therapy on levels of happiness, F(2, 12)=5.12, p=0.025.Post hoc comparisons using the Holm test indicated that the mean score for the control condition was significantly different from the 30-minute Condition. However, the 15-minute condition did not significantly differ from the control condition and the 30-minute condition.
or
In a study to examine ATV rider (ATV_type) and the number of
years the rider has been riding (Yrs_riding), 77 people where classified
into one of three categories: Active, Casual, and Dedicated. The mean
years riding was lowest for the Casual classification (M = 8.32, SD =
4.10), higher for Active classification (M = 15.16, SD = 4.15) and the
highest for the Dedicated classification (M = 21.25, SD = 3.72). A
Kruskal-Wallis rank sum test was conducted to examine the effect these
drugs effect on mood gain. The main effect of the ATV type of rider was
significant, K(2) = 51.014, p-value = 8.36e-12. I reject
the null that the medians are equal in favor of the alternative. At
least one ATV type of rider has a median difference from the other
classifications. In other words, I have evidence to conclude that there
are differences between the medians of the groups. Post-hoc pairwise
comparisons using Wilcoxon rank sum test with continuity correction
revealed that the median years riding for Dedicated was significantly
higher than that for Casual (p = 5.5e-09) and Active (p = 6.5e-06). Also
the median years riding for Active was significantly higher than that
for Casual (p = 6.5e-06). These results suggest that there are
significant differences in years riding among the three types of ATV
riders.