Hi readers, Greetings!
This R Markdown Notebook journals my learning and understanding about two-way ANOVA. The reading material are listed in reference list.
1. About two-way ANOVA
Two-way ANOVA (“analysis of variance”) is used to determine if there is
a statistically significant mean difference between groups that have
been split on two independent variables (called factors).
Two-way ANOVA test has three hypotheses:
Null hypotheses:
that the means of observations grouped by one factor are the same;
that the means of observations grouped by the other factor are the same;
and that there is no interaction between the two factors.
Before employing two-way ANOVA in analysis, we need to know:
When should we use a two-way ANOVA?
What are the assumptions to be met to perform a two-way ANOVA?
2. When should we use two-way
ANOVA?
We use two-way ANOVA to understand how two factors affect a response
variable and whether there is an interaction effect between the two
factors on the response variable, which tells whether the effects of one
factor depend on the other factor.
Practically, we make an interaction-plot (using interaction.plot())
to exploratory visualise if interaction exists among the interested
factors.
* if the lines are parallel >> no interaction.
* if the lines are intersecting >> interaction effect
exists.
Example: There is interaction between Supp(factor 1), Dose(factor 2) on len response.
For the consequence of overlooking interaction effect, see here: https://statisticsbyjim.com/regression/interaction-effects/
3. What are the assumptions for two-way
ANOVA?
To use a two-way ANOVA, our data should meet certain assumptions:
Homogeneity of variance (a.k.a. homoscedasticity)
The variation around the mean be should be
similar among all groups. Use a non-parametric alternative, like the
Kruskal-Wallis test, if data could not meet this
assumption.
Independence of observations
Each factors should independent on one
another (i.e. one should not cause the other). Use blocking variable and/or use a
repeated-measures ANOVA if data doesn’t meet this assumption
(i.e. set up experimental treatments within blocks)
Normally-distributed response variable
The response variable of model should
approximately follow a Gaussian distribution (bell curve) . Try a
data transformation if data doesn’t meet this assumption.
Reference:
1:
www.sthda.com
2:
www.r-bloggers.com
3:
www.statology.org
4:
www.scribbr.com
5:
www.guru99.com
6.
https://statistics.laerd.com/spss-tutorials/two-way-anova-using-spss-statistics.php
Case Study: ToothGrowth
Data
4. Data Preparation R
4.1 Read built-in data named ToothGrowth in R For more
info about the data, click: ToothGrowth
# read data
my_data <- ToothGrowth
Now that ToothGrowth data is imported and named as my_data.
4.2 Next, let’s display randomly selected 10 observation from my_data
set.seed(1234) #make it repeatable
# install.packages("dplyr")
# show a random sample
dplyr::sample_n(my_data, 10)
4.3 Let’s check data type
# Check the structure
str(my_data) # or using dplyr::glimpse(my_data)
## 'data.frame': 60 obs. of 3 variables:
## $ len : num 4.2 11.5 7.3 5.8 6.4 10 11.2 11.2 5.2 7 ...
## $ supp: Factor w/ 2 levels "OJ","VC": 2 2 2 2 2 2 2 2 2 2 ...
## $ dose: num 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...
Note from above output, dose is not a factor yet.
4.4 Check no of levels in dose
table(my_data$dose)
##
## 0.5 1 2
## 20 20 20
there are three level: 0.5, 1, 2 in dose factor.
4.5 Convert dose to three factors
my_data$dose <- factor(my_data$dose,
levels = c(0.5, 1, 2),
labels = c("D0.5", "D1", "D2"))
Now that three levels in dose factor are labelled as “D0.5”, “D1”, “D2”.
4.6 Generate frequency table
table(my_data$supp, my_data$dose)
##
## D0.5 D1 D2
## OJ 10 10 10
## VC 10 10 10
We can see that the samples are equally distributed in each group. So, this will only require equal sample size two-way ANOVA. (unequal sample size will require Two-way ANOVA for unbalanced design which is not covered here.)
4.7 Let’s visualise data with interaction plot
#install.packages("ggpubr")
library("ggpubr")
# Line plots with multiple groups
# Plot tooth length ("len") by groups ("dose")
# Color box plot by a second group: "supp"
# Add error bars: mean_se (other values include: mean_sd, mean_ci, median_iqr, ....)
ggline(my_data, x = "dose", y = "len", color = "supp",
add = c("mean_se", "dotplot"),
palette = c("#00AFBB", "#E7B800"))
5. Two-way ANOVA in R
The question to answer by Two-way ANOVA will be:
Based on the p-values and a significance level of 0.05 from the following ANOVA results, we can conclude that:
the p-value of supp is 0.000429 (significant), which implies that the levels of supp are associated with significant different tooth length.
the p-value of dose is < 2e-16 (significant), which indicates that the levels of dose are associated with significant different tooth length.
res.aov2 <- aov(len ~ supp + dose, data = my_data) # additive model
summary(res.aov2)
## Df Sum Sq Mean Sq F value Pr(>F)
## supp 1 205.4 205.4 14.02 0.000429 ***
## dose 2 2426.4 1213.2 82.81 < 2e-16 ***
## Residuals 56 820.4 14.7
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Based on the p-values and a significance level of 0.05 from the following ANOVA results, we can conclude that:
# Two-way ANOVA with interaction effect
# These two calls are equivalent
res.aov3 <- aov(len ~ supp * dose, data = my_data)
res.aov3 <- aov(len ~ supp + dose + supp:dose, data = my_data) #interactive model: with synergistic effect
summary(res.aov3)
## Df Sum Sq Mean Sq F value Pr(>F)
## supp 1 205.4 205.4 15.572 0.000231 ***
## dose 2 2426.4 1213.2 92.000 < 2e-16 ***
## supp:dose 2 108.3 54.2 4.107 0.021860 *
## Residuals 54 712.1 13.2
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
6. Check of Assumption
Recall the major assumptions for Two-way ANOVA model:
Homogeneity of variance (a.k.a. homoscedasticity)
Independence of observations
Normally-distributed response variable
# a. Homogeneity of variances
plot(res.aov3, 1)
Points 32 and 23 are detected as outliers, which can affect normality and homogeneity of variance. Removing outliers can meet the test assumptions.
Next, we will do a visual check of residuals normality check using QQ-plot. QQ-plot is not an air-tight proof, it only provides a visual sense of normality. If all the points fall approximately along the reference line (straight line), we can assume normality.
# a. Normality check using QQ-plot
plot(res.aov3, 2)
Other than QQ-plot, it is encouraged to perform Shapiro Wilk test. If the p-value of Shapiro-Wilk test is greater than 0.05, it is failed to reject the null that data is normally distributed.
# a. Normality check using Shaporo Wilk Test
aov_residuals <- residuals(object = res.aov3)
# Run Shapiro-Wilk test
shapiro.test(x = aov_residuals )
##
## Shapiro-Wilk normality test
##
## data: aov_residuals
## W = 0.98499, p-value = 0.6694
# c. Normallity of response variable
# Run Shapiro-Wilk test
shapiro.test(x = my_data$len )
##
## Shapiro-Wilk normality test
##
## data: my_data$len
## W = 0.96743, p-value = 0.1091
P-value of Shapiro Wilk Test are greater than 0.05 for both residuals (p-value = 0.6694) and len (p-value = 0.1091), hence, it is failed to reject the null that data is normally distributed.
7. Wrap up
In this notebook, we get to learn about:
When to use Two-way ANOVA to examine main effect and interaction effect between two factors.
How to perform Two-way ANOVA in R.
Assumption check using visual check and statistical check.
With this, we can for example, use a two-way ANOVA to ask whether there is an interaction between gender and educational level on quality of life (measured by questionairre eq-5d vs sf-36) amongst community-dwelling stroke survivor, where gender (males/females) and education level (undergraduate/postgraduate) are the independent variables (factors), and quality of life is the dependent variable.