MANOVA stands for Multivariate ANOVA or Multivariate Analysis Of
Variance. It’s an extension of regular ANOVA. The general idea is the
same, but the MANOVA test has to include at least two dependent
variables to analyze differences between multiple groups (factors) of
the independent variable.
If you only have a single dependent variable, then there’s no point
in using MANOVA. Regular ANOVA will suit you fine. For example, if you
want to see if a petal length is associated with different types of Iris
species, there’s no point in using MANOVA, as you only have a single
dependent variable (petal length). On the contrary, if you have data on
petal length and petal width, then using MANOVA would be a wise thing to
do.
MANOVA in R uses Pillai’s Trace test for the calculations, which is
then converted to an F-statistic when we want to check the significance
of the group mean differences. You can use other tests, such as Wilk’s
Lambda, Roy’s Largest Root, or Hotelling-Lawley’s test, but Pillai’s
Trace test is the most powerful one.
A regular ANOVA test often suffers from a lot of type I errors. These
are caused when you perform multiple ANOVA tests for each dependent
variable. MANOVA provides a solution, as it’s able to capture group
differences based on combined information of the multiple dependent
variables.
Because MANOVA uses more than one dependent variable, the null and
the alternative hypotheses are slightly changed:
H0: Group mean vectors are the same for all groups or
they don’t differ significantly.
H1: At least one of the group mean vectors is different
from the rest.
MANOVA in R won’t tell you which group differs from the rest, but
that’s easy to determine via a post-hoc test. We’ll use Linear
Discriminant Analysis (LDA) to answer this question later.
As you would expect, MANOVA statistical test has many strict
assumptions. The ones from ANOVA carry over – independence of
observations and homogeneity of variances – and some new are
introduced:
Multivariate normality – Each combination of
independent or dependent variables should have a multivariate normal
distribution. Use Shapiro-Wilk’s test to verify.
Linearity – Dependent variables should have a linear
relationship with each group (factor) of the independent variable.
No multicollinearity – Dependent variables should have
very high correlations. No outliers – There shouldn’t
be any outliers in the dependent variables.
As with most of the things in R, performing a MANOVA statistical test
boils down to a single function call. But we’ll need a dataset first.
The Iris dataset is well-known among the data science crowd, and it is
built into R:
head(iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
As MANOVA cares for the difference in means for each factor, let’s
visualize the boxplot for every dependent variable. There will be 4
plots in total arranged in a 2×2 grid, each having a dedicated boxplot
for a specific flower species:
library(ggplot2)
library(gridExtra)
box_sl <- ggplot(iris, aes(x = Species, y = Sepal.Length, fill = Species)) +
geom_boxplot() +
theme(legend.position = "top")
box_sw <- ggplot(iris, aes(x = Species, y = Sepal.Width, fill = Species)) +
geom_boxplot() +
theme(legend.position = "top")
box_pl <- ggplot(iris, aes(x = Species, y = Petal.Length, fill = Species)) +
geom_boxplot() +
theme(legend.position = "top")
box_pw <- ggplot(iris, aes(x = Species, y = Petal.Width, fill = Species)) +
geom_boxplot() +
theme(legend.position = "top")
grid.arrange(box_sl, box_sw, box_pl, box_pw, ncol = 2, nrow = 2)
We can now perform a one-way MANOVA in R. The best practice is to
separate the dependent from the independent variable before calling the
manova() function. Once the test is done, you can print its
summary:
dependent_vars <- cbind(iris$Sepal.Length, iris$Sepal.Width, iris$Petal.Length, iris$Petal.Width)
independent_var <- iris$Species
manova_model <- manova(dependent_vars ~ independent_var, data = iris)
summary(manova_model)
Df Pillai approx F num Df den Df Pr(>F)
independent_var 2 1.1919 53.466 8 290 < 2.2e-16 ***
Residuals 147
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
By default, MANOVA in R uses Pillai’s Trace test statistic. The
P-value is practically zero, which means we can safely reject the null
hypothesis in the favor of the alternative one – at least one group mean
vector differs from the rest.
While we’re here, we could also measure the effect size. One metric
often used with MANOVA is Partial Eta Squared. It measures the effect
the independent variable has on the dependent variables. If the value is
0.14 or greater, we can say the effect size is large. Here’s how to
calculate it in R:
library(effectsize)
eta_squared(manova_model)
# Effect Size for ANOVA (Type I)
Parameter | Eta2 (partial) | 95% CI
-----------------------------------------------
independent_var | 0.60 | [0.54, 1.00]
- One-sided CIs: upper bound fixed at [1.00].
The P-Value is practically zero, and the Partial Eta Squared suggests
a large effect size – but which group or groups are different from the
rest? There’s no way to tell without a post-hoc test. We’ll use
Linear Discriminant Analysis (LDA), which finds a
linear combination of features that best separates two or more
groups.
By doing so, we’ll be able to visualize a scatter plot showing the
two linear discriminants on the X and Y axes, and color code them to
match our independent variable – the flower species.
You can implement Linear Discriminant Analysis in R using the lda()
function from the MASS package:
library(MASS)
iris_lda <- lda(independent_var ~ dependent_vars, CV = F)
iris_lda
Call:
lda(independent_var ~ dependent_vars, CV = F)
Prior probabilities of groups:
setosa versicolor virginica
0.3333333 0.3333333 0.3333333
Group means:
dependent_vars1 dependent_vars2 dependent_vars3 dependent_vars4
setosa 5.006 3.428 1.462 0.246
versicolor 5.936 2.770 4.260 1.326
virginica 6.588 2.974 5.552 2.026
Coefficients of linear discriminants:
LD1 LD2
dependent_vars1 0.8293776 0.02410215
dependent_vars2 1.5344731 2.16452123
dependent_vars3 -2.2012117 -0.93192121
dependent_vars4 -2.8104603 2.83918785
Proportion of trace:
LD1 LD2
0.9912 0.0088
Take a look at the coefficients to see how the dependent variables
are used to form the LDA decision rule. LD1 is calculated as
, when rounded to two decimal
points.
The snippet below uses the predict() function to get the linear
discriminants and combines them with our independent variable:
library(rmarkdown)
lda_df <- data.frame(
species = iris[, "Species"],
lda = predict(iris_lda)$x
)
paged_table(lda_df)
The final step in this post-hoc test is to visualize the above lda_df
as a scatter plot. Ideally, we should see one or multiple groups stand
out:
ggplot(lda_df) +
geom_point(aes(x = lda.LD1, y = lda.LD2, color = species), size = 4) +
theme_classic()
The setosa species is significantly different when compared to
virginica and versicolor. These two are more similar, suggesting that it
was the setosa group that had the most impact for us to reject the null
hypothesis.
To summarize – the group mean vector of the setosa class is significantly different from the other group means, so it’s safe to assume it was a crucial factor for rejecting the null hypothesis.