The multivariate EDA of the penguins data brought some simplicity to understanding the quantitative variables of body mass, flipper length and the two culmen variables. One main virtue of the graphical methods used there was that species could be represented visually in the plot and annotated with separate data ellipses and regression lines, even though the principal component analysis and biplot did not include species in the analysis.
An obvious next step, and the one that brought the iris data to prominence is to ask two related multivariate model questions:
MANOVA: To what extent can we be confident that the means of species on these variables reliably differ significantly in multivariate space? A multivariate analysis of variance (MANOVA) tests group mean differences simultaneously, without appeal to univariate tests or correction for multiple testing.
Discriminant analysis: What weighted sums of the variables best discriminates among the penguin species? The answer to this question is that the discriminant weights are found in the MANOVA, and are the weights that maximize the multivariate test statistics.
These ideas are implemented in the heplots (Fox and Friendly 2014) and candisc (Friendly and Fox 2013) packages and described in more detail in other articles (Friendly 2007; Friendly, Monette, and Fox 2013).
But there is more:
HE plots: Hypothesis-Error plots use ellipses to show the strength of evidence for any multivariate hypothesis
Canonical views: Project the data and ellipses into the low-dimensional space that shows the largest multivariate effects.
data(penguins, package="palmerpenguins")
As before, we clean the data: abbreviating variable names, making some character variables factors, and removing missing data.
peng <- penguins %>%
rename(
culmen_length = culmen_length_mm,
culmen_depth = culmen_depth_mm,
flipper_length = flipper_length_mm,
body_mass = body_mass_g
) %>%
mutate(species = as.factor(species),
island = as.factor(island),
sex = as.factor(substr(sex,1,1))) %>%
filter(!is.na(culmen_depth),
!is.na(sex))
library(car)
library(heplots)
library(candisc)
Let’s assume that the goal of an analysis is to determine whether and how the penguin species differ in terms of the four quantitative variables. A naive multivariate linear model is the following, with species as the only predictor.
peng.mod0 <-lm(cbind(culmen_length, culmen_depth, flipper_length, body_mass) ~
species, data=peng)
Anova(peng.mod0)
##
## Type II MANOVA Tests: Pillai test statistic
## Df test stat approx F num Df den Df Pr(>F)
## species 2 1.64 371 8 656 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
In a univariate ANOVA, the \(F\)-test is based on the variability between groups relative to the variability within groups \[ F = \frac{SS_H/df_h}{SS_E / df_e} \] The analogous multivariate tests are based on the same idea, applied to the between-group sum of squares and cross-products (SSP) matrix, \(\mathbf{H}\), relative to the within-group SSP matrix, \(\mathbf{E}\).
To see this visually, consider a plot of the two culmen beak variables where we show only the data ellipses for each species and the pooled data ellipse of the residuals.
cols = c(scales::hue_pal()(3), "black")
covEllipses(peng[,3:4], peng$species,
col=cols,
fill=c(rep(FALSE,3), TRUE), fill.alpha=.1)
In the HE plot below,
the differences between species means are shown by what we call the \(\mathbf{H}\) ellipsoid, the data ellipsoid of the fitted (predicted) values under the model. In 2D, with a two degree-of-freedom test, \(\mathbf{H}\) has two dimensions and appears as a simple ellipse.
the variation within species is reflected in the \(\mathbf{E}\) ellipse, which is just the pooled data ellipses of the groups, translated to the grand means, or equivalently, the data ellipse of the residuals. By default, the size of the ellipse is set to cover 68% of the observations in a bivariate normal sample, an analog of a \(\bar{y} \pm 1 s\) univariate interval. With this 68% interval, you can “read” the residual standard deviation as the half-length of the shadow of the \(\mathbf{E}\) ellipse on any axis. Translating this ellipse to the grand means allows us to show the group centroids on the same scale, facilitating interpretation.
The orientation of the \(\mathbf{H}\) ellipse reflects the negative correlation of the species means: in general species with larger culmen depth have smaller culmen length.
heplot(peng.mod0, fill=TRUE, fill.alpha=0.1,
size="effect",
xlim=c(35,52), ylim=c(14,20))
But the overall size of the \(\mathbf{H}\) ellipse relative to that of \(\mathbf{E}\) is crucial:
In the plot above, a visual estimate is that variation of the species means is about ES = 1.5 times that of within-species (residual) variance. This is a large effect!
heplot is what we call significance (or “evidence”) scaling. That is the relative size of the \(\mathbf{H}\) ellipse is set so that in the full p-dimensional space of all \(p\) response variables, the \(\mathbf{H}\) ellipse will protrude somewhere outside the \(\mathbf{E}\) ellipse if and only if the test for some effect is significant (by Roy’s largest root test) at a prescribed \(\alpha\)-level (0.05 by default). This scaling uses \(\mathbf{H}/\lambda_\alpha \: df_e\), where \(\lambda_\alpha\) is the critical \(\alpha\) value of Roy’s test statistic.heplot(peng.mod0, fill=TRUE, fill.alpha=0.1)
The MANOVA test is so overwhelmingly significant that I will stick to effect-size scaling to preserve resolution.
One lovely virtue of the standard linear model fit with lm() is that any ANOVA effect of \(df_h \ge 2\) degrees of freedom can be partitioned into \(df_h\) separate 1 df tests of “linear hypotheses”. If these use orthogonal contrasts, then the separate \(\mathbf{H}\) matrices are additive.
\[\begin{equation*} \mathbf{H} = \mathbf{H}_1 + \mathbf{H}_2 + \cdots + \mathbf{H}_{\textrm{df}_h} \end{equation*}\]
Let’s also assume that in addition to the overall question, the researchers are interested in two sub-hypotheses:
These questions can be formulated in terms of two orthogonal contrasts for the species factor. The columns give weights indicating differences between group means.
contrasts(peng$species)<-matrix(c(1,-1,0, -1, -1, -2), 3,2)
contrasts(peng$species)
## [,1] [,2]
## Adelie 1 -1
## Chinstrap -1 -1
## Gentoo 0 -2
In the heplots package, contrasts are specified as a list of tests evaluated by car::linearHypothesis. Each of these generate an \(\mathbf{H}\) matrix of rank 1, so these will appear as lines in an HE plot.
peng.mod0 <-lm(cbind(culmen_length, culmen_depth, flipper_length, body_mass) ~ species, data=peng)
hyp <- list("A:C"="species1","AC:G"="species2")
heplot(peng.mod0, fill=TRUE, fill.alpha=0.2,
hypotheses=hyp, size="effect")
This is quite a nice result. It is clear that Adelie & Chinstrap differ only in the length of the culmen. Gentoo penguins differ from the other two in having longer, but less deep culmen.
heplot gives 2D plots for any pair of response variables. For flipper_length and body_mass we get the plot below. It is readily seen that these two variables are highly correlated in their species means. We can interpret the major axis of this H ellipse as a measure of pengiun size, with Gentoo being the largest.
heplot(peng.mod0, variables=3:4,
size="effect",
fill=TRUE, fill.alpha=0.2)
The pairs.mlm method for mlm objects gives a all pairwise heplots in a scatterplot matrix format.
pairs(peng.mod0, size="effect", fill=c(TRUE, FALSE))
In these pairwise views, we see some new things:
The within-group residuals are positively correlated for all pairs. This makes sense to me.
As we saw above, group means for flipper length and body mass are positively correlated, but
There are negative correlations of culmen depth with the two size variables, and
Between species culmen depth has a correlation \(\approx 1\) with body size. Time to call in the biologists!
Finally, there is an heplot3d() fundtion that shows the H and E ellipsoids in 3D space. This plot should be interactive in an HTML document (drag to rotate, zoom in/out with mouse wheel).
next3d()
heplot3d(peng.mod0, fill=TRUE, fill.alpha=0.2, size="effect")
You must enable Javascript to view this page properly.
If you rotate the plot, you will see that the H ellipsoid is actually 2-dimensional, because there are \(g-1 = 2\) df.
A careful researcher would also wonder, from a modeling perspective, whether differences among species could be accounted for by possible confounders: island and sex.
peng.mod1 <-lm(cbind(culmen_length, culmen_depth, flipper_length, body_mass) ~
species + island + sex, data=peng)
Anova(peng.mod1)
##
## Type II MANOVA Tests: Pillai test statistic
## Df test stat approx F num Df den Df Pr(>F)
## species 2 1.493 239.3 8 650 <2e-16 ***
## island 2 0.029 1.2 8 650 0.31
## sex 1 0.642 145.1 4 324 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
island seems not to be important, so we delete this
peng.mod2 <-lm(cbind(culmen_length, culmen_depth, flipper_length, body_mass) ~
species + sex, data=peng)
Anova(peng.mod2)
##
## Type II MANOVA Tests: Pillai test statistic
## Df test stat approx F num Df den Df Pr(>F)
## species 2 1.66 392 8 654 <2e-16 ***
## sex 1 0.64 145 4 326 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
heplot(peng.mod2, size="effect",
fill=TRUE, fill.alpha=0.2)
So, in addition to the means among species, we also see that male penguins have both larger and deeper culmens. We could go further to test for an interaction of species with sex, using cbind( ...) ~ species:sex for the model.
The heplot(), pairs.mlm(), and heplot3d() functions provide 2D and 3D views of all effects in a MLM in variable space. Alternatively, canonical discriminant analysis (CDA) provides a low-D visualization of between-group variation and vectors reflecting the weights of the response variables on these dimensions in canonical space.
You can imagine this as rotating the 4D space of all four variables, and projecting the observations on the plane that shows the largest variation in the species means.
In this space, the response variables can be shown as vectors, whose angles relative to the canonical dimensions reflect their correlations. The relative lengths of these vectors indicates their contribution to discrimination among species.
This is analogous to a biplot often used in conjunction with a principal component analysis. See: penguins_biplots
The printed output of the result of candisc() shows that 70% of the between-species mean differences can be accounted for by one linear combination of the size variables. The second canonical variable accounts for the remaining 13.4%.
(peng.can <- candisc(peng.mod0))
##
## Canonical Discriminant Analysis for species:
##
## CanRsq Eigenvalue Difference Percent Cumulative
## 1 0.938 15.03 12.7 86.5 86.5
## 2 0.700 2.34 12.7 13.5 100.0
##
## Test of H0: The canonical correlations in the
## current row and all that follow are zero
##
## LR test stat approx F numDF denDF Pr(> F)
## 1 0.0187 516 8 654 <2e-16 ***
## 2 0.2997 255 3 328 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The plot() method for candisc objects plots the canonical discriminant scores in 2D space, with the data ellipses for each group. The variable vectors reflect the correlations of each variable with the two canonical dimensions.
col <- scales::hue_pal()(3)
plot(peng.can, col=col, pch=15:17,
ellipse=TRUE, scale=9,
var.lwd = 2, var.col = "black")
From this we can see:
The first dimension (Can1) separates Gentoo from the other two species. Can2 largely separates Adelie from Chinstrap.
Can1 is positively correlated with flipper length and body mass, our penguin “size” variables. But culmen depth also separates Gentoo from the others, in the opposite direction
Can2 here largely reflects culmen length, with Chintrap having longer beaks than the other two.
Fox, John, and Michael Friendly. 2014. Heplots: Visualizing Hypothesis Tests in Multivariate Linear Models. http://CRAN.R-project.org/package=heplots.
Friendly, Michael. 2007. “HE Plots for Multivariate General Linear Models.” Journal of Computational and Graphical Statistics 16 (2): 421–44. https://doi.org/10.1198/106186007X208407.
Friendly, Michael, and John Fox. 2013. Candisc: Visualizing Generalized Canonical Discriminant and Canonical Correlation Analysis. http://CRAN.R-project.org/package=candisc.
Friendly, Michael, Georges Monette, and John Fox. 2013. “Elliptical Insights: Understanding Statistical Methods Through Elliptical Geometry.” Statistical Science 28 (1): 1–39. https://doi.org/http://dx.doi.org/10.1214/12-STS402.