Welcome to this R demo session! Here, I will demonstrate how to use R to
library(ggplot2)
library(tidyverse) # data manipulation and visualization
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ lubridate 1.9.3 ✔ tibble 3.2.1
## ✔ purrr 1.0.2 ✔ tidyr 1.3.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggpubr) # reating easily publication ready plots
library(rstatix) # for easy pipe-friendly statistical analyses
##
## Attaching package: 'rstatix'
##
## The following object is masked from 'package:stats':
##
## filter
library(car) # for MANOVA analysis
## Loading required package: carData
##
## Attaching package: 'car'
##
## The following object is masked from 'package:dplyr':
##
## recode
##
## The following object is masked from 'package:purrr':
##
## some
library(broom) # for printing a nice summary of statistical tests as data frames
We’ll use the built-in R dataset iris
. Select columns of
interest:
iris2 <- iris %>%
select(Sepal.Length, Petal.Length, Species) %>%
add_column(id = 1:nrow(iris), .before = 1)
head(iris2)
## id Sepal.Length Petal.Length Species
## 1 1 5.1 1.4 setosa
## 2 2 4.9 1.4 setosa
## 3 3 4.7 1.3 setosa
## 4 4 4.6 1.5 setosa
## 5 5 5.0 1.4 setosa
## 6 6 5.4 1.7 setosa
The R code below creates a merged box plots of
Sepal.Length
and Petal.Length
by
Species
groups.
iris2_long <- iris2 %>%
# Use the 'pivot_longer' function to reshape the dataframe, selecting specific columns to pivot
pivot_longer(cols = c(Sepal.Length, Petal.Length), # Specify the columns to be pivoted into long format
names_to = "Variable", # Rename the new column holding the original column names as "Variable"
values_to = "Value") # Rename the new column holding the values from the original columns as "Value"
ggplot(data = iris2_long, aes(x = as.factor(Species),
y = Value)) +
geom_boxplot(aes(color = Variable),
position = position_dodge(width = 0)) + # add the boxplot
geom_point(aes(color = Variable),position = position_jitter(width = .1), alpha = .08, size = 2) + # add individual data points
stat_summary(fun = mean,
geom = "point",
aes(color = Variable)) + # add the mean as a point
stat_summary(fun = mean,
geom = "line",
aes(color = Variable)) + # add the line between groups
stat_summary(fun.data = mean_cl_boot,
geom = "errorbar",
width = 0.3,
aes(color = Variable)) + # add error bars
labs(x = "Proficiency level",
y = "Writing scores") + # rename x- and y-axis
scale_color_brewer(palette = "Set1")
## `geom_line()`: Each group consists of only one observation.
## ℹ Do you need to adjust the group aesthetic?
Compute summary statistics (mean, SD) by groups for each outcome variable:
iris2 %>%
group_by(Species) %>%
get_summary_stats(Sepal.Length, Petal.Length, type = "mean_sd")
## # A tibble: 6 × 5
## Species variable n mean sd
## <fct> <fct> <dbl> <dbl> <dbl>
## 1 setosa Sepal.Length 50 5.01 0.352
## 2 setosa Petal.Length 50 1.46 0.174
## 3 versicolor Sepal.Length 50 5.94 0.516
## 4 versicolor Petal.Length 50 4.26 0.47
## 5 virginica Sepal.Length 50 6.59 0.636
## 6 virginica Petal.Length 50 5.55 0.552
Rule of thumb: the n in each cell > the number of outcome variables.
iris2 %>%
group_by(Species) %>%
summarise(N = n())
## # A tibble: 3 × 2
## Species N
## <fct> <int>
## 1 setosa 50
## 2 versicolor 50
## 3 virginica 50
As the table above shows 50 observations per group, the assumption of adequate sample size is satisfied.
Each subject should belong to only one group. There is no relationship between the observations in each group. Having repeated measures for the same participants is not allowed. The selection of the sample should be completely random.
It’s hard to evaluate whether the assumption is met as we are not sure about how the data was collected.
Univariate outliers can be easily identified using box plot methods,
implemented in the r function identify_outliers()
[in the
rstatix
package].
Group the data by Species
and then, identify outliers in
the Sepal.Length
variable:
iris2 %>%
group_by(Species) %>%
identify_outliers(Sepal.Length)
## # A tibble: 1 × 6
## Species id Sepal.Length Petal.Length is.outlier is.extreme
## <fct> <int> <dbl> <dbl> <lgl> <lgl>
## 1 virginica 107 4.9 4.5 TRUE FALSE
Group the data by Species
and then, identify outliers in
the Petal.Length
variable:
iris2 %>%
group_by(Species) %>%
identify_outliers(Petal.Length)
## # A tibble: 5 × 6
## Species id Sepal.Length Petal.Length is.outlier is.extreme
## <fct> <int> <dbl> <dbl> <lgl> <lgl>
## 1 setosa 14 4.3 1.1 TRUE FALSE
## 2 setosa 23 4.6 1 TRUE FALSE
## 3 setosa 25 4.8 1.9 TRUE FALSE
## 4 setosa 45 5.1 1.9 TRUE FALSE
## 5 versicolor 99 5.1 3 TRUE FALSE
There were no univariate extreme outliers in the
Sepal.Length
and Petal.length
variable, as
assessed by box plot methods.
Note that, in the situation where you have extreme outliers, this can be due to: data entry errors, measurement errors or unusual values.
Yo can include the outlier in the analysis anyway if you do not believe the result will be substantially affected. This can be evaluated by comparing the result of the MANOVA with and without the outlier. Remember to report in your written results section any decisions you make regarding any outliers you find.
Multivariate outliers are data points that have an unusual combination of values on the outcome variables.
In MANOVA setting, the Mahalanobis distance is generally used to detect multivariate outliers. The distance tells us how far an observation is from the center of the cloud, taking into account the shape (covariance) of the cloud as well.
The function mahalanobis_distance()
[in the
rstatix
package] can be easily used to compute the
Mahalanobis distance and to flag multivariate outliers. Read more in the
documentation of the function.
This metric needs to be calculated by groups:
# Compute distance by groups and filter outliers
# Use -id to omit the id column in the computation
iris2 %>%
group_by(Species) %>%
mahalanobis_distance(-id) %>%
filter(is.outlier == TRUE) %>%
as.data.frame()
## [1] id Sepal.Length Petal.Length mahal.dist is.outlier
## <0 rows> (or 0-length row.names)
There were no multivariate outliers in the data, as assessed by Mahalanobis distance (p > 0.001).
If you have multivariate outliers, you could consider running MANOVA before and after removing the outlier to check whether or not their presence alter the results. You should report your final decision.
The normality assumption can be checked by computing Shapiro-Wilk test for each outcome variable at each level of the grouping variable. If the data is normally distributed, the p-value should be greater than 0.05.
iris2 %>%
group_by(Species) %>%
shapiro_test(Sepal.Length, Petal.Length) %>%
arrange(variable)
## # A tibble: 6 × 4
## Species variable statistic p
## <fct> <chr> <dbl> <dbl>
## 1 setosa Petal.Length 0.955 0.0548
## 2 versicolor Petal.Length 0.966 0.158
## 3 virginica Petal.Length 0.962 0.110
## 4 setosa Sepal.Length 0.978 0.460
## 5 versicolor Sepal.Length 0.978 0.465
## 6 virginica Sepal.Length 0.971 0.258
Sepal.Length and Petal.length were normally distributed for each Species groups, as assessed by Shapiro-Wilk’s test (p > 0.05).
You can also create QQ plot for each group. QQ plot draws the correlation between a given data and the normal distribution.
# QQ plot of Sepal.Length
ggqqplot(iris2, "Sepal.Length", facet.by = "Species",
ylab = "Sepal Length", ggtheme = theme_bw())
# QQ plot of Petal.Length
ggqqplot(iris2, "Petal.Length", facet.by = "Species",
ylab = "Petal Length", ggtheme = theme_bw())
All the points fall approximately along the reference line, for each group. So we can assume normality of the data.
Note that, if your sample size is greater than 50, the normal QQ plot is preferred because at larger sample sizes the Shapiro-Wilk test becomes very sensitive even to a minor deviation from normality.
In the situation where the assumptions are not met, you could consider running MANOVA on the data after transforming the outcome variables. You can also perform the test regardless as MANOVA is fairly robust to deviations from normality as long as you have adequate sample size.
The R function mshapiro_test()
[in the
rstatix
package] can be used to perform the Shapiro-Wilk
test for multivariate normality.
iris2 %>%
select(Sepal.Length, Petal.Length) %>%
mshapiro_test()
## # A tibble: 1 × 2
## statistic p.value
## <dbl> <dbl>
## 1 0.995 0.855
The test is not significant (p > 0.05), so we can assume multivariate normality.
The dependent (outcome) variables cannot be too correlated to each other. Ideally the correlation between the outcome variables should be moderate, not too high. A correlation above 0.9 is an indication of multicollinearity, which is problematic for MANOVA.
In other hand, if the correlation is too low, you should consider running separate one-way ANOVA for each outcome variable.
Compute pairwise Pearson correlation coefficients between the outcome
variable. In the following R code, we’ll use the function
cor_test()
[rstatix package]. If you have more than two
outcome variables, consider using the function
cor_mat()
:
iris2 %>% cor_test(Sepal.Length, Petal.Length)
## # A tibble: 1 × 8
## var1 var2 cor statistic p conf.low conf.high method
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 Sepal.Length Petal.Length 0.87 21.6 1.04e-47 0.827 0.906 Pearson
There was no multicollinearity, as assessed by Pearson correlation (r = 0.87, p < 0.0001).
In the situation, where you have multicollinearity, you could consider removing one of the outcome variables that is highly correlated.
The pairwise relationship between the outcome variables should be
linear for each group. This can be checked visually by creating a
scatter plot matrix using the R function ggpairs()
[GGally
package]. In our example, we have only one pair:
library(GGally)
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
results <- iris2 %>%
select(Sepal.Length, Petal.Length, Species) %>%
group_by(Species) %>%
doo(~ggpairs(.) + theme_bw(), result = "plots")
results
## # A tibble: 3 × 2
## Species plots
## <fct> <list>
## 1 setosa <gg>
## 2 versicolor <gg>
## 3 virginica <gg>
# Show the plots
results$plots
## [[1]]
##
## [[2]]
##
## [[3]]
There was a linear relationship between Sepal.Length
and
Petal.Length
in each Species
group, as
assessed by scatter plot.
In the situation, where you detect non-linear relationships, You can:
The Levene’s test can be used to test the equality of variances between groups. Non-significant values of Levene’s test indicate equal variance between groups.
For each of the outcome variables, the one-way MANOVA assumes that
there are equal variances between groups. This can be checked using the
Levene’s test of equality of variances. Key R function:
levene_test()
[rstatix
package].
Procedure:
iris2 %>%
gather(key = "variable", value = "value", Sepal.Length, Petal.Length) %>%
group_by(variable) %>%
levene_test(value ~ Species)
## # A tibble: 2 × 5
## variable df1 df2 statistic p
## <chr> <int> <int> <dbl> <dbl>
## 1 Petal.Length 2 147 19.5 0.0000000313
## 2 Sepal.Length 2 147 6.35 0.00226
The Levene’s test is significant (p < 0.05), so there was no homogeneity of variances.
Note that, if you do not have homogeneity of variances, you can try to transform the outcome (dependent) variable to correct for the unequal variances. Alternatively, you can continue, but accept a lower level of statistical significance (alpha level) for your MANOVA result. Additionally, any follow-up univariate ANOVAs will need to be corrected for this violation (i.e., you will need to use different post-hoc tests).
The Box’s M Test can be used to check the equality of covariance between the groups. This is the equivalent of a multivariate homogeneity of variance. This test is considered as highly sensitive. Therefore, significance for this test is determined at alpha = 0.001.
This can be evaluated using the Box’s M-test implemented in the
rstatix
package.
box_m(iris2[, c("Sepal.Length", "Petal.Length")], iris2$Species)
## # A tibble: 1 × 4
## statistic p.value parameter method
## <dbl> <dbl> <dbl> <chr>
## 1 58.4 9.62e-11 6 Box's M-test for Homogeneity of Covariance Matri…
The test is statistically significant (i.e., p < 0.001), so the data have violated the assumption of homogeneity of variance-covariance matrices.
Note that, if you have balanced design (i.e., groups with similar sizes), you don’t need to worry too much about violation of the homogeneity of variances-covariance matrices and you can continue your analysis.
However, having an unbalanced design is problematic. Possible solutions include:
There are four different types of multivariate statistics that can be
used for computing MANOVA. These are: Pillai
(Pillai
Trace), Wilks
(Wilk’s Lambda),
Hotelling-Lawley
(Hotelling-Lawley Trace), or
Roy
(Roy’s Maximum Root).
The most commonly recommended multivariate statistic to use is Wilks’ Lambda. However, Pillai’s Trace is more robust and is recommended when you have unbalanced design and also have a statistically significant Box’s M result (as in our example, see previous section).
Note that, Pillai
is the default in the R
Manova()
function [car
package].
model <- lm(cbind(Sepal.Length, Petal.Length) ~ Species, iris2)
Manova(model, test.statistic = "Pillai")
##
## Type II MANOVA Tests: Pillai test statistic
## Df test stat approx F num Df den Df Pr(>F)
## Species 2 0.9885 71.829 4 294 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
There was a statistically significant difference between the Species on the combined dependent variables (Sepal.Length and Petal.Length), F(4, 294) = 71.829, p < 0.0001.
To compute the effect size, we may use the eta_squared()
function.
effectsize::eta_squared(model, partial = FALSE)
## # Effect Size for ANOVA (Type I)
##
## Response | Parameter | Eta2 | 95% CI
## ----------------------------------------------
## Sepal.Length | Species | 0.62 | [0.54, 1.00]
## Petal.Length | Species | 0.94 | [0.93, 1.00]
Note that for one-way between subjects designs (one-way MANOVA),
partial eta squared is equivalent to eta squared. According to the
values above, the effects are considered large for both
Sepal.Length
and Petal.Length
.
A statistically significant one-way MANOVA can be followed up by univariate one-way ANOVA examining, separately, each dependent variable. The goal is to identify the specific dependent variables that contributed to the significant global effect.
Procedure:
Note that, there are different R function to compute one-way ANOVA depending whether the assumptions are met or not:
anova_test()
[rstatix
]: can be used when
normality and homogeneity of variance assumptions are metwelch_anova_test()
[rstatix
]: can be used
when the homogeneity of variance assumption is violated, as in our
example.kruskal_test()
[rstatix
]: Kruskal-Wallis
test, a non parametric alternative of one-way ANOVA testThe following R codes shows how to use each of these functions:
# Group the data by variable
grouped.data <- iris2 %>%
gather(key = "variable", value = "value", Sepal.Length, Petal.Length) %>%
group_by(variable)
# Do welch one way anova test
grouped.data %>% welch_anova_test(value ~ Species)
## # A tibble: 2 × 8
## variable .y. n statistic DFn DFd p method
## * <chr> <chr> <int> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 Petal.Length value 150 1828. 2 78.1 2.69e-66 Welch ANOVA
## 2 Sepal.Length value 150 139. 2 92.2 1.51e-28 Welch ANOVA
# or do Kruskal-Wallis test
grouped.data %>% kruskal_test(value ~ Species)
## # A tibble: 2 × 7
## variable .y. n statistic df p method
## * <chr> <chr> <int> <dbl> <int> <dbl> <chr>
## 1 Petal.Length value 150 130. 2 4.80e-29 Kruskal-Wallis
## 2 Sepal.Length value 150 96.9 2 8.92e-22 Kruskal-Wallis
# or use aov()
grouped.data %>% anova_test(value ~ Species)
## # A tibble: 2 × 8
## variable Effect DFn DFd F p `p<.05` ges
## * <chr> <chr> <dbl> <dbl> <dbl> <dbl> <chr> <dbl>
## 1 Petal.Length Species 2 147 1180. 2.86e-91 * 0.941
## 2 Sepal.Length Species 2 147 119. 1.67e-31 * 0.619
Let’s just focus on the results of anova_test()
There was a statistically significant difference in Sepal.Length (F(2, 147) = 119, p < 0.0001 ) and Petal.Length (F(2, 147) = 1180, p < 0.0001 ) between iris Species.
Note that, as we have two dependent variables, we need to apply one of the methods for significance level correction, such as Bonferroni multiple testing correction by decreasing the he level we declare statistical significance. This is done by dividing classic alpha level (0.05) by the number of tests (or dependent variables, here 2). This leads to a significance acceptance criteria of p < 0.025 rather than p < 0.05 because there are two dependent variables.
A statistically significant univariate ANOVA can be followed up by multiple pairwise comparisons to determine which groups are different.
The R functions tukey_hsd()
[rstatix
package] can be used to compute Tukey post-hoc tests if the homogeneity
of variance assumption is met.
If you had violated the assumption of homogeneity of variances, as in
our example, you might prefer to run a Games-Howell post-hoc test. It’s
also possible to use the function pairwise_t_test()
[rstatix
] with the option pool.sd = FALSE
and
var.equal = FALSE
.
pwc <- iris2 %>%
gather(key = "variables", value = "value", Sepal.Length, Petal.Length) %>%
group_by(variables) %>%
games_howell_test(value ~ Species)
pwc
## # A tibble: 6 × 9
## variables .y. group1 group2 estimate conf.low conf.high p.adj
## * <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Petal.Length value setosa versicolor 2.80 2.63 2.97 1.85e-11
## 2 Petal.Length value setosa virginica 4.09 3.89 4.29 1.68e-11
## 3 Petal.Length value versicolor virginica 1.29 1.05 1.54 4.45e-10
## 4 Sepal.Length value setosa versicolor 0.93 0.719 1.14 2.86e-10
## 5 Sepal.Length value setosa virginica 1.58 1.34 1.83 0
## 6 Sepal.Length value versicolor virginica 0.652 0.376 0.928 5.58e- 7
## # ℹ 1 more variable: p.adj.signif <chr>
All pairwise comparisons were significant for each of the outcome
variable (Sepal.Length
and Petal.Length
).
A one-way multivariate analysis of variance was performed to determine the effect of iris Species on Sepal.Length and Petal.Length. There are three different species: setosa, versicolor and virginica.
There was a statistically significant difference between the Species on the combined dependent variables (Sepal.Length and Petal.Length), F(4, 294) = 71.829, p < 0.0001.
Follow-up univariate ANOVAs, employing a Bonferroni-adjusted significance level of 0.025, revealed statistically significant differences in Sepal.Length (F(2, 147) = 119, p < 0.0001) and Petal.Length (F(2, 147) = 1180, p < 0.0001) across the iris species.
For Sepal.Length, the mean values indicate that the virginica species exhibits the highest average length at 6.59, followed by versicolor with a mean of 5.94, and setosa with the shortest average length of 5.01. Similarly, in the case of Petal.Length, virginica again presents the highest mean value at 5.55, followed by versicolor at 4.26, and setosa with the lowest average length of 1.46.
All pairwise comparisons between the species groups were significant for both outcome variables, Sepal.Length and Petal.Length, indicating distinct differences in these measurements among the iris species.