Introduction

The multivariate EDA of the penguins data brought some simplicity to understanding the quantitative variables of body mass, flipper length and the two culmen variables. One main virtue of the graphical methods used there was that species could be represented visually in the plot and annotated with separate data ellipses and regression lines, even though the principal component analysis and biplot did not include species in the analysis.

An obvious next step, and the one that brought the iris data to prominence is to ask two related multivariate model questions:

These ideas are implemented in the heplots (Fox and Friendly 2014) and candisc (Friendly and Fox 2013) packages and described in more detail in other articles (Friendly 2007; Friendly, Monette, and Fox 2013).

Multivariate visualization

But there is more:

  • HE plots: Hypothesis-Error plots use ellipses to show the strength of evidence for any multivariate hypothesis

  • Canonical views: Project the data and ellipses into the low-dimensional space that shows the largest multivariate effects.

Getting started

data(penguins, package="palmerpenguins")

As before, we clean the data: abbreviating variable names, making some character variables factors, and removing missing data.

peng <- penguins %>%
    rename(
         culmen_length = culmen_length_mm, 
         culmen_depth = culmen_depth_mm, 
         flipper_length = flipper_length_mm, 
         body_mass = body_mass_g
         ) %>%
  mutate(species = as.factor(species),
         island = as.factor(island),
         sex = as.factor(substr(sex,1,1))) %>%
  filter(!is.na(culmen_depth),
         !is.na(sex))

Load our packages

library(car)
library(heplots)
library(candisc)

MANOVA

Let’s assume that the goal of an analysis is to determine whether and how the penguin species differ in terms of the four quantitative variables. A naive multivariate linear model is the following, with species as the only predictor.

peng.mod0 <-lm(cbind(culmen_length, culmen_depth, flipper_length, body_mass) ~
                 species, data=peng)
Anova(peng.mod0)
## 
## Type II MANOVA Tests: Pillai test statistic
##         Df test stat approx F num Df den Df Pr(>F)    
## species  2      1.64      371      8    656 <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

HE plots

In a univariate ANOVA, the \(F\)-test is based on the variability between groups relative to the variability within groups \[ F = \frac{SS_H/df_h}{SS_E / df_e} \] The analogous multivariate tests are based on the same idea, applied to the between-group sum of squares and cross-products (SSP) matrix, \(\mathbf{H}\), relative to the within-group SSP matrix, \(\mathbf{E}\).

To see this visually, consider a plot of the two culmen beak variables where we show only the data ellipses for each species and the pooled data ellipse of the residuals.

cols = c(scales::hue_pal()(3), "black")
covEllipses(peng[,3:4], peng$species, 
            col=cols,
            fill=c(rep(FALSE,3), TRUE), fill.alpha=.1)

In the HE plot below,

The orientation of the \(\mathbf{H}\) ellipse reflects the negative correlation of the species means: in general species with larger culmen depth have smaller culmen length.

heplot(peng.mod0, fill=TRUE, fill.alpha=0.1, 
       size="effect",
       xlim=c(35,52), ylim=c(14,20))

But the overall size of the \(\mathbf{H}\) ellipse relative to that of \(\mathbf{E}\) is crucial:

In the plot above, a visual estimate is that variation of the species means is about ES = 1.5 times that of within-species (residual) variance. This is a large effect!

heplot(peng.mod0, fill=TRUE, fill.alpha=0.1)

The MANOVA test is so overwhelmingly significant that I will stick to effect-size scaling to preserve resolution.

Contrasts

One lovely virtue of the standard linear model fit with lm() is that any ANOVA effect of \(df_h \ge 2\) degrees of freedom can be partitioned into \(df_h\) separate 1 df tests of “linear hypotheses”. If these use orthogonal contrasts, then the separate \(\mathbf{H}\) matrices are additive.

\[\begin{equation*} \mathbf{H} = \mathbf{H}_1 + \mathbf{H}_2 + \cdots + \mathbf{H}_{\textrm{df}_h} \end{equation*}\]

Let’s also assume that in addition to the overall question, the researchers are interested in two sub-hypotheses:

  • Do Gentoo penguins differ from the other two?
  • Do Adelie differ from Chinstrap?

These questions can be formulated in terms of two orthogonal contrasts for the species factor. The columns give weights indicating differences between group means.

contrasts(peng$species)<-matrix(c(1,-1,0, -1, -1, -2), 3,2)
contrasts(peng$species)
##           [,1] [,2]
## Adelie       1   -1
## Chinstrap   -1   -1
## Gentoo       0   -2

In the heplots package, contrasts are specified as a list of tests evaluated by car::linearHypothesis. Each of these generate an \(\mathbf{H}\) matrix of rank 1, so these will appear as lines in an HE plot.

peng.mod0 <-lm(cbind(culmen_length, culmen_depth, flipper_length, body_mass) ~ species, data=peng)

hyp <- list("A:C"="species1","AC:G"="species2")
heplot(peng.mod0, fill=TRUE, fill.alpha=0.2, 
       hypotheses=hyp, size="effect")

This is quite a nice result. It is clear that Adelie & Chinstrap differ only in the length of the culmen. Gentoo penguins differ from the other two in having longer, but less deep culmen.

Other plots

heplot gives 2D plots for any pair of response variables. For flipper_length and body_mass we get the plot below. It is readily seen that these two variables are highly correlated in their species means. We can interpret the major axis of this H ellipse as a measure of pengiun size, with Gentoo being the largest.

heplot(peng.mod0, variables=3:4, 
       size="effect",
       fill=TRUE, fill.alpha=0.2)

Pairs plots

The pairs.mlm method for mlm objects gives a all pairwise heplots in a scatterplot matrix format.

pairs(peng.mod0, size="effect", fill=c(TRUE, FALSE))

In these pairwise views, we see some new things:

  • The within-group residuals are positively correlated for all pairs. This makes sense to me.

  • As we saw above, group means for flipper length and body mass are positively correlated, but

  • There are negative correlations of culmen depth with the two size variables, and

  • Between species culmen depth has a correlation \(\approx 1\) with body size. Time to call in the biologists!

3D plots

Finally, there is an heplot3d() fundtion that shows the H and E ellipsoids in 3D space. This plot should be interactive in an HTML document (drag to rotate, zoom in/out with mouse wheel).

next3d()
heplot3d(peng.mod0, fill=TRUE, fill.alpha=0.2, size="effect")

You must enable Javascript to view this page properly.

If you rotate the plot, you will see that the H ellipsoid is actually 2-dimensional, because there are \(g-1 = 2\) df.

Other models

A careful researcher would also wonder, from a modeling perspective, whether differences among species could be accounted for by possible confounders: island and sex.

peng.mod1 <-lm(cbind(culmen_length, culmen_depth, flipper_length, body_mass) ~
                 species + island + sex, data=peng)
Anova(peng.mod1)
## 
## Type II MANOVA Tests: Pillai test statistic
##         Df test stat approx F num Df den Df Pr(>F)    
## species  2     1.493    239.3      8    650 <2e-16 ***
## island   2     0.029      1.2      8    650   0.31    
## sex      1     0.642    145.1      4    324 <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

island seems not to be important, so we delete this

peng.mod2 <-lm(cbind(culmen_length, culmen_depth, flipper_length, body_mass) ~
                 species + sex, data=peng)
Anova(peng.mod2)
## 
## Type II MANOVA Tests: Pillai test statistic
##         Df test stat approx F num Df den Df Pr(>F)    
## species  2      1.66      392      8    654 <2e-16 ***
## sex      1      0.64      145      4    326 <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
heplot(peng.mod2, size="effect", 
       fill=TRUE, fill.alpha=0.2)

So, in addition to the means among species, we also see that male penguins have both larger and deeper culmens. We could go further to test for an interaction of species with sex, using cbind( ...) ~ species:sex for the model.

Canonical views

The heplot(), pairs.mlm(), and heplot3d() functions provide 2D and 3D views of all effects in a MLM in variable space. Alternatively, canonical discriminant analysis (CDA) provides a low-D visualization of between-group variation and vectors reflecting the weights of the response variables on these dimensions in canonical space.

You can imagine this as rotating the 4D space of all four variables, and projecting the observations on the plane that shows the largest variation in the species means.

In this space, the response variables can be shown as vectors, whose angles relative to the canonical dimensions reflect their correlations. The relative lengths of these vectors indicates their contribution to discrimination among species.

This is analogous to a biplot often used in conjunction with a principal component analysis. See: penguins_biplots

The printed output of the result of candisc() shows that 70% of the between-species mean differences can be accounted for by one linear combination of the size variables. The second canonical variable accounts for the remaining 13.4%.

(peng.can <- candisc(peng.mod0))
## 
## Canonical Discriminant Analysis for species:
## 
##   CanRsq Eigenvalue Difference Percent Cumulative
## 1  0.938      15.03       12.7    86.5       86.5
## 2  0.700       2.34       12.7    13.5      100.0
## 
## Test of H0: The canonical correlations in the 
## current row and all that follow are zero
## 
##   LR test stat approx F numDF denDF Pr(> F)    
## 1       0.0187      516     8   654  <2e-16 ***
## 2       0.2997      255     3   328  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The plot() method for candisc objects plots the canonical discriminant scores in 2D space, with the data ellipses for each group. The variable vectors reflect the correlations of each variable with the two canonical dimensions.

col <- scales::hue_pal()(3)
plot(peng.can, col=col, pch=15:17, 
     ellipse=TRUE, scale=9, 
     var.lwd = 2, var.col = "black")

From this we can see:

References

Fox, John, and Michael Friendly. 2014. Heplots: Visualizing Hypothesis Tests in Multivariate Linear Models. http://CRAN.R-project.org/package=heplots.

Friendly, Michael. 2007. “HE Plots for Multivariate General Linear Models.” Journal of Computational and Graphical Statistics 16 (2): 421–44. https://doi.org/10.1198/106186007X208407.

Friendly, Michael, and John Fox. 2013. Candisc: Visualizing Generalized Canonical Discriminant and Canonical Correlation Analysis. http://CRAN.R-project.org/package=candisc.

Friendly, Michael, Georges Monette, and John Fox. 2013. “Elliptical Insights: Understanding Statistical Methods Through Elliptical Geometry.” Statistical Science 28 (1): 1–39. https://doi.org/http://dx.doi.org/10.1214/12-STS402.