\[ \def\vec#1{\mathbf{#1}} \def\mat#1{\mathbf{#1}} \def\H{\mathbf{H}} \def\E{\mathbf{E}} \def\trans{^\mathsf{T}} % transpose \def\period{\:\: .} \def\comma{\:\: ,} \def\inv#1{\mat{#1}^{-1}} \def\half#1{\mat{#1}^{-1/2}} \]
This article illustrates some graphical methods that we have developed over the last ten years that aid in the understanding and communication of the results of multivariate linear models (Friendly 2007; Friendly, Monette, and Fox 2013). Just as a boxplot is a visual summary of the mean and variance of a 1D sample, these methods rely on data ellipsoids as simple, minimally sufficient visualizations of means and (co-)variance that can be shown in 2D and 3D plots.
I don’t describe the theory behind these methods here. Instead, I illustrate the use of the R packages car, heplots, and candisc applied to a classic data set often used as examples of ultivariate analysis of variance (MANOVA), discriminant analysis and principal component analysis. More details and a wide range of examples are given in Friendly and Sigal (2017).
Imagine you are a plant biologist and have measurements on four different characteristics of different species of a kind of flower. Can you devise a rule or test to distinguish one species from another? If so, what are the weights for the variables that best distinguish among the species? These were the problems that Edgar Anderson (Anderson 1935) faced in 1935 when he collected data on three species of iris flowers found on the Gaspe Penninsula of Quebec, Canada. The species are “Setosa”, “Versicolor”, and “Virginica”, and he carefully measured the length and width of two parts of each specimen: the flower petals and sepals (the green leaves that enclose the flower).
The iris dataset became famous in the history of statistics because it was shortly used by R. A. Fisher (Fisher 1936) to introduce the method of discriminant analysis, which answered Anderson’s questions. Implicit here was the more basic question: Are the differences in the means for the species large enough (relative to with-species variability) to conclude that they differ significantly on this collection of measurements? This is the question now addressed under the topic of Multivariate Analysis of Variance (MANOVA).
Three species of irises in the Anderson/Fisher data set: setosa (left), versicolor (center), and virginica (right).
For such a dataset, perhaps the easiest thing to do is to prepare univariate plots (boxplots), bivariate plots (scatterplot matrix) and analyses (anova) for each response variable separately. This gives some information, but usually not comprehensive enough to understand the multivariate response variables: how they are related and how they differ in distinguishing among groups.
First, define nice colors to distinguish the iris species. We use the equivalent of the default colors for a factor in ggplot2 (based on equally spaced hues around the color wheel, starting from 15).
gg_color_hue <- function(n) {
hues = seq(15, 375, length = n + 1)
hcl(h = hues, l = 65, c = 100)[1:n]
}
(col <- gg_color_hue(3))
## [1] "#F8766D" "#00BA38" "#619CFF"
We use the car::Boxplot function to show the distributions of the iris size measurements, with a separate panel for each measure. All the measures are in cm., but it is more useful to see them plotted on separate scales to compare the specied on each size variable. From this, we can see that the means for the species are ordered, setosa < versicolor < virginica on all variables except for Sepal.Width.
op <- par(mfrow=c(1, 4))
for (response in names(iris)[1:4]){
Boxplot(iris[, response] ~ Species, data=iris,
ylab=NULL,
axes=FALSE,
col=col,
main = response,
cex.lab = 1.5)
box()
axis(2)
axis(1, at=1:3, labels=c("setosa", "vers.", "virginica"))
}
par(op)
A scatterplot matrix shows all pairwise scatterplots. The base R function is pairs, but car::scatterplotMatrix provides data ellipses for each group as well as other simple enhancements. Again, we note that the pattern of group means is different for Sepal.Width than for the other variables.
scatterplotMatrix(~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width | Species,
data=iris,
smooth=FALSE, regLine=FALSE,
ellipse = list(levels=0.68,
fill=TRUE,
fill.alpha=0.2),
by.groups = TRUE,
diagonal = FALSE,
label.pos = 0.5,
col = col,
pch = 15:17,
cex = 0.9,
legend = list(coords="bottomleft",
cex=2,
title="Species")
)
The data ellipses show something new: The bivariate scatter — variances and covariances (correlations) — also differ among species. Generally, the variances of Setosa are smaller than the other two species, and the within-group correlations seem different for Setosa. This is a topic we explore elsewhere (Friendly and Sigal 2018), in visualization methods for the assumption of homogeneity of covariances in MANOVA models.
What we cannot see from these univariate or bivariate views is how the size variables contribute to distinguishing among the species and how they relate collectively to an overall MANOVA test.
A standard MANOVA would use the lm() function, where the left side (Y), of the model formula, Y ~ X comprises a matrix of response variables, and the right side (X) is specified exactly as in model formulae for univariate problems.
iris.mod <- lm(cbind(Sepal.Length, Sepal.Width, Petal.Length, Petal.Width) ~
Species, data=iris)
Anova(iris.mod)
##
## Type II MANOVA Tests: Pillai test statistic
## Df test stat approx F num Df den Df Pr(>F)
## Species 2 1.19 53.5 8 290 <2e-16
In this case, the evidence for significant differences among the means of the iris species is overwhelming, but even in such a clear case it is hard to specify **exactly* what that amounts to. HE plots give a visual summary of the relations of response variables and how they contribute to tests of group differences.
The idea of the HE plot is to take a bivariate or multivariate view of this problem and focus visual attention on the important aspects:
To illustrate this, the figure below shows a scatterplot of the data for the relationship between sepal width and sepal length next to the analogous HE plot. We can show these views for (a) pairs of variables, (b) all pairs, in a scatterplot matrix format, or (c) in a low-D “canonical” view showing the dimensions that most discriminate among the groups.
# (a) Scatterplot with data ellipses
car::scatterplot(Sepal.Width ~ Sepal.Length | Species, data=iris,
smooth=FALSE, regLine=FALSE,
ellipse = list(levels=0.68, fill=TRUE, fill.alpha=0.2),
legend = list(coords="topright", cex=1.2, title="Species"),
col = col, pch=15:17, grid=FALSE,
xlim=c(2,9), ylim=c(1,5)
)
# (b) HE plot
heplot(iris.mod, fill=c(TRUE, FALSE))
In the HE plot at the right,
the differences between species means are shown by what we call the \(\mat{H}\) ellipsoid, the data ellipsoid of the fitted (predicted) values under the model. In 2D, with a two degree-of-freedom test, \(\mat{H}\) has two dimensions and appears as a simple ellipse.
the variation within species is reflected in the \(\mat{E}\) ellipse, which is just the pooled data ellipses of the groups, translated to the grand means, or equivalently, the data ellipse of the residuals. By default, the size of the ellipse is set to cover 68% of the observations in a bivariate normal sample, an analog of a \(\bar{y} \pm 1 s\) univariate interval. With this 68% interval, you can “read” the residual standard deviation as the half-length of the shadow of the \(\mat{E}\) ellipse on any axis. Translating this ellipse to the grand means allows us to show the group centroids on the same scale, facilitating interpretation.
The orientation of the \(\mat{H}\) ellipse reflects the negative correlation of the species means: in general species with larger sepal length have smaller sepal width.
But the overall size of the \(\mat{H}\) ellipse relative to that of \(\mat{E}\) is crucial:
One simple choice, effect size scaling, uses \(\mat{H}/df_e\) to put this on the same scale as the \(\mat{E}\) ellipse. This is analogous to an effect size index used in univariate models, e.g., \(ES = (\bar{y}_1 - \bar{y}_2) / s\) in a two-group design.
The default used in heplot is what we call significance (or “evidence”) scaling. That is the relative size of the \(\mat{H}\) ellipse is set so that in the full p-dimensional space of all \(p\) response variables, the \(\mat{H}\) ellipse will protrude somewhere outside the \(\mat{E}\) ellipse if and only if the test for some effect is significant (by Roy’s largest root test) at a prescribed \(\alpha\)-level (0.05 by default). This scaling uses \(\mat{H}/\lambda_\alpha \: df_e\), where \(\lambda_\alpha\) is the critical \(\alpha\) value of Roy’s test statistic.
The plots below compare the two scalings using the same (x, y) limits in both plots. Considering the relative areas of the ellipses, a verbal interpretation might be:
The differences among the iris species in this view are overwhelmingly significant, and the greatest differences are between the setosa flower and the others.
to a rough approximation, the effect size visually is ~ 1.0. Differences between groups on average are about the same size as the within group standard deviation projected on any axis. In conventional terms (e.g., Cohen’s \(d\)), this is considered a large effect.
# (a) Significance scaling
op <- par(mar=c(4,4,1,1)+0.1)
res <- heplot(iris.mod,
fill=TRUE, fill.alpha=c(0.3, 0.1),
cex=1.25, cex.lab=1.5)
label <- expression(paste("Significance scaling:", H / lambda[alpha], df[e]))
text(7.5, 4.5, label, cex=1.5)
# (b) Effect size scaling
heplot(iris.mod, size="effect",
fill=TRUE, fill.alpha=c(0.3, 0.1),
cex=1.25, cex.lab=1.5,
xlim=res$xlim, ylim=res$ylim)
label <- expression(paste("Effect size scaling:", H / df[e]))
text(7.5, 4.5, label, cex=1.5)
par(op)
One lovely virtue of the standard linear model fit with lm() is that any ANOVA effect of \(df_h \ge 2\) degrees of freedom can be partitioned into \(df_h\) separate 1 df tests of “linear hypotheses” using orthogonal contrasts.
This is lovely both mathematically and visually. Each contrast \(i\) corresponds to a rank 1 \(\mat{H}_i\) matrix which additively decompose the overall rank \(df_h\) hypothesis SSP \(\mat{H}\) matrix as:
\[\begin{equation*} \mat{H} = \mat{H}_1 + \mat{H}_2 + \cdots + \mat{H}_{\textrm{df}_h} \comma \end{equation*}\] exactly as the univariate \(SS_H\) may be decomposed in an ANOVA. Each of these rank 1 \(\mat{H}_i\) matrices will plot as a degenerate ellipse— a line in an HE plot. Their collection provides a visual summary of the overall test, as partitioned by these orthogonal contrasts, but more importantly, an “explanation” for the overall test in terms of answers to \(\textrm{df}_h\) independent questions.
Here, we choose one contrast comparing Versicolor to Virginca, and a second (orthogonal) contrast comparing Setosa to the average of the other two species. Contrasts are assigned to a factor using the contrasts() function on the left-hand side of an assignment
# HE plots: testing linear hypotheses for contrasts
C <- matrix(c(0, -1, 1,
2, -1, -1), nrow=3, ncol=2)
contrasts(iris$Species) <- C
contrasts(iris$Species)
## [,1] [,2]
## setosa 0 2
## versicolor -1 -1
## virginica 1 -1
iris.mod <- lm(cbind(Sepal.Length, Sepal.Width, Petal.Length, Petal.Width) ~
Species, data=iris)
In the heplots package, contrasts are specified as a list of tests evaluated by car::linearHypothesis. Each of these generate an \(\mat{H}\) matrix of rank 1, so these will appear as lines in an HE plot.
# add tests for contrasts
hyp <- list("V:V"="Species1","S:VV"="Species2")
heplot(iris.mod, hypotheses=hyp)
When the contrasts are orthogonal (as here), the corresponding H lines form conjugate axes of the H ellipse for the overall species effect. Their relative sizes show the portion of the overall effect they account for. So here, the largest portion is attributable to the difference of Setosa from the other two species. Many more details of connections between geometry of ellipses and statistics are described by Friendly, Monette, and Fox (2013).
A pairs() plot for an mlm object invokes the pairs.mlm() function which constructs a matrix of pairwise HE plots for a multivariate linear model. You can think of this as a high-level summary of the multivariate response variables, showing the contributions of all model terms to multivariate tests.
pairs(iris.mod, hypotheses=hyp, hyp.labels=FALSE,
fill=TRUE, fill.alpha=0.1)
There is something entirely new shown here: for most pairs of variables, larger X in the means for a species goes with larger Y, and the group means are nearly perfectly correlated. This is true for all pairs except for those involving Sepal.Width, where the direction is reversed.
The heplot(), pairs.mlm(), and heplot3d() functions provide 2D and 3D views of all effects in a MLM in variable space. Alternatively, canonical discriminant analysis (CDA) provides a low-D visualization of between-group variation and vectors reflecting the weights of the response variables on these dimensions in canonical space.
You can imagine this as rotating the 4D space of all four size variables, and projecting the observations on the plane that shows the largest variation in the species means. This is analogous to a biplot often used in conjunction with a principal component analysis.
The printed output of the result of candisc() shows that over 99% of the between-species mean differences can be accounted for by one linear combination of the size variables.
# iris data
iris.can <- candisc(iris.mod, data=iris)
iris.can
##
## Canonical Discriminant Analysis for Species:
##
## CanRsq Eigenvalue Difference Percent Cumulative
## 1 0.970 32.192 31.9 99.121 99.1
## 2 0.222 0.285 31.9 0.879 100.0
##
## Test of H0: The canonical correlations in the
## current row and all that follow are zero
##
## LR test stat approx F numDF denDF Pr(> F)
## 1 0.023 199.1 8 288 < 2e-16
## 2 0.778 13.8 3 145 5.8e-08
The plot() method for candisc objects plots the canonical discriminant scores in 2D space, with the data ellipses for each group. The variable vectors reflect the correlations of each size variable with the two canonical dimensions.
plot(iris.can, col=col, pch=15:17, rev.axes = c(TRUE, FALSE),
ellipse=TRUE, scale=4,
var.lwd = 2, var.col = "black")
The interpretation here is quite simple: the first canonical dimension (Can1), accounting for 99.1% of mean differences, largely reflects the overall size of each iris flower. The second dimension (Can2) is statistically significant, though practically unimportant, as it accounts for only 0.9% of mean differences. It is most associated with Sepal.Width but this variable goes in the opposite direction on Can1 from the other three.
Because this canonical structure is essentially one-dimensional, a 1D version might be a sufficient visual summary. In this case, plot.candisc() using which=1 displays boxplots for the Can1 scores together with the structure coefficients (correlations) of the size variables with this dimension.
plot(iris.can, which=1,
points.1d=TRUE, pch=15:17, rev.axes = TRUE,
var.lwd = 2, var.col = "black")
This focuses attention on the main story here: The iris species differ mainly in overall size, and Sepal.Length and the two petal variables are correlated positively with this, while Sepal.Width has a negative correlation
Finally, in cases where there are more groups and/or more response variables, an HE plot in canonical discriminant space is a useful and compact visual summary. This is provided by the heplot method for candisc objects.
heplot(iris.can, rev.axes = c(TRUE, FALSE),
fill = TRUE, fill.alpha = 0.2,
scale = 30)
Graphical methods for univariate linear models are relatively well-developed and widely used, but copmparable methods for multivariate response models are relatively recent, This article illustrates a number of extensions of these methods to the case of multivariate responses in a simple one-way MANOVA design. The HE plot framework illustrated here provides new methods for graphical insight into multivariate data in the context of linear models.
Anderson, E. 1935. “The Irises of the Gaspé Peninsula.” Bulletin of the American Iris Society 35: 2–5.
Fisher, R. A. 1936. “The Use of Multiple Measurements in Taxonomic Problems.” Annals of Eugenics 8: 379–88.
Friendly, Michael. 2007. “HE Plots for Multivariate General Linear Models.” Journal of Computational and Graphical Statistics 16 (2): 421–44. https://doi.org/10.1198/106186007X208407.
Friendly, Michael, Georges Monette, and John Fox. 2013. “Elliptical Insights: Understanding Statistical Methods Through Elliptical Geometry.” Statistical Science 28 (1): 1–39. https://doi.org/http://dx.doi.org/10.1214/12-STS402.
Friendly, Michael, and Matthew Sigal. 2017. “Graphical Methods for Multivariate Linear Models in Psychological Research: An R Tutorial.” The Quantitative Methods for Psychology 13 (1): 20–45. https://doi.org/10.20982/tqmp.13.1.p020.
———. 2018. “Visualizing Tests for Equality of Covariance Matrices.” The American Statistician 72 (4). https://doi.org/10.1080/00031305.2018.1497537.