Penguin data: Multivariate EDA

Penguins are multivariate creatures! What can we learn from this dataset without specialized biological knowledge?

This analysis gives a new perspective on this data set. We will see that there are simple multivariate explanations to the variation of pengiuns, both between species and within a given species. You can learn a lot about penguins from the comfort of your PC with a few simple techniques.

Load the `penguins` data and do some clean up

These examples use the basic penguins dataset.

data(penguins, package="palmerpenguins")

For this example:

shorten variable names (remove units) to simplify variable labels,
create factors for character variables (needed for MANOVA), and
remove NA observations (causes problems with PCA)

peng <- penguins %>%
    rename(
         culmen_length = culmen_length_mm, 
         culmen_depth = culmen_depth_mm, 
         flipper_length = flipper_length_mm, 
         body_mass = body_mass_g
         ) %>%
  mutate(species = as.factor(species),
         island = as.factor(island),
         sex = as.factor(substr(sex,1,1))) %>%
  filter(!is.na(culmen_depth))

Take a peek at the tibble:

peng

## # A tibble: 342 x 7
##    species island    culmen_length culmen_depth flipper_length body_mass sex  
##    <fct>   <fct>             <dbl>        <dbl>          <int>     <int> <fct>
##  1 Adelie  Torgersen          39.1         18.7            181      3750 M    
##  2 Adelie  Torgersen          39.5         17.4            186      3800 F    
##  3 Adelie  Torgersen          40.3         18              195      3250 F    
##  4 Adelie  Torgersen          36.7         19.3            193      3450 F    
##  5 Adelie  Torgersen          39.3         20.6            190      3650 M    
##  6 Adelie  Torgersen          38.9         17.8            181      3625 F    
##  7 Adelie  Torgersen          39.2         19.6            195      4675 M    
##  8 Adelie  Torgersen          34.1         18.1            193      3475 <NA> 
##  9 Adelie  Torgersen          42           20.2            190      4250 <NA> 
## 10 Adelie  Torgersen          37.8         17.1            186      3300 <NA> 
## # ... with 332 more rows

Load our packages

library(car)
library(ggbiplot)
library(GGally)

Scatterplot matrix

For multivariate data, rather than making separate bivariate scatterplots a more more comprehensive view is given by a scatterplot matrix – all pairwise plots of a set of quantitative varibles.

In base R, this would be easily done with the pairs() function.

But there are better alternatives:

car::scatterplotMatrix() provides many options for enhancing the basic pairs() plot.
- in particular, in the model formula, x1 + x2 + ... | group, the different groups (species) are annotated and colored separately.
- the diagonal panels show the univariate distributions for each group. This shows a smoothed density estimate for each group, and you can easily see the difference among species in terms of means, variablilty and shape.
- each pairwise panel shows a data ellipse covering ~ 68% of the observations in each group.
GGally::ggpairs extends the idea of pairs plots to include categorical variables along with quantitative ones.

scatterplotMatrix(~ culmen_length + culmen_depth + flipper_length + body_mass | species,
                  data=peng,
                  ellipse=list(levels=0.68),
                  col = scales::hue_pal()(3),
                  legend=list(coords="bottomright"))

From this, we can see some interesting penguin features:

Most of the distributions are approximately unimodal and symmetric within species. One exception is culmen_length for Chinstrap pengiuns.
withi species, all pairs of variables are positively correlated, and the slopes of regression lines are approximately the same within each off-diagonal panel.

ggpairs

GGally::ggpairs is designed to display mixtures of numeric and discrete variables with flexible choices of how variables are displayed. See the vignette [https://ggobi.github.io/ggally/articles/ggpairs.html] for some examples.

Briefly, by default, ggpairs :

displays pairs of quantitative variables in the lower triangle as scatterplots, and the same pairs in the upper triangle with correlation coefficients, both overall and within group.
a pair of one quantitative and one discrete variable are shown either as color-coded histograms (lower triangle) or boxplots (upper triangle).
pairs of two discrete variables are shown with stacked bar charts.

ggpairs(peng, mapping = aes(color = species), 
        columns = c("culmen_length", "culmen_depth", 
                    "flipper_length", "body_mass",
                    "island", "sex"))

This plot is somewhat messy, and could be tweaked further. I’ll just mention one other feature revealed here:

The upper triangle for the quantitative variable shows the overall correlations (ignoring species), and the within-group correlation for each species. There are three cases that illustrate Simpson’s Paradox: where the overall correlation has a different sign from the within-group correlations.

PCA

We can think of principal component analysis as a multivariate juicer: it tries to squeeze as much flavor out of a multivariate sample, when the criterion for “flavor” is the greatest amount of variance accounted for in 1, 2, 3, … dimensions (weighted sums of the variables.)

We run the PCA with prcomp(). Because the variable scales are not all commensurable, it is important to scale them to equal variance.

peng.pca <- prcomp (~ culmen_length + culmen_depth + flipper_length + body_mass,
                    data=peng,
                    na.action=na.omit,  # not actually necessary: we removed NA
                    scale. = TRUE)

peng.pca

## Standard deviations (1, .., p=4):
## [1] 1.65944 0.87893 0.60435 0.32938
## 
## Rotation (n x k) = (4 x 4):
##                     PC1        PC2      PC3      PC4
## culmen_length   0.45525 -0.5970311 -0.64430  0.14552
## culmen_depth   -0.40033 -0.7977666  0.41843 -0.16799
## flipper_length  0.57601 -0.0022822  0.23208 -0.78380
## body_mass       0.54835 -0.0843629  0.59660  0.57988

How many dimensions are useful and necessary to portray the variation among penquins on these variables? It looks like two are sufficient. (The usual simple criterion is to look for the “elbow” in a screeplot.)

screeplot(peng.pca, type = "line", lwd=3, cex=3, 
        main="Variances of PCA Components")

Biplot

The results of a PCA can best be viewed as a biplot. This shows

the observations in the space of PC1 and PC2. Data ellipses for each color show the within-species variation.
the pengiun size variables are represented as vectors from the origin. The angles they make with the PC1 and PC2 axes reflect their correlations with the principal components.

In the call to ggbiplot,

groups = peng$species provides separate colors and point styles for the species.
ellipse = TRUE causes a data ellipse to be drawn for each species. This shows the within-species correlations of the observation scores on PC1 & PC2.
circle = TRUE draws a correlation circle, reflecting the fact that for all species combined, PC1 & PC2 are uncorrelated.

ggbiplot(peng.pca, obs.scale = 1, var.scale = 1,
         groups = peng$species, 
         ellipse = TRUE, circle = TRUE) +
  scale_color_discrete(name = 'Penguin Species') +
  theme_minimal() +
  theme(legend.direction = 'horizontal', legend.position = 'top')

From this, we can see:

These two principal components account for 68.8 + 19.3 = 88.1 % of the total variance of these four size variables.
PC1 is largely determined by flipper length and body mass. We can interpret this as an overall measure of penguin size.
On this dimension, Gentoos are the largest, by quite a lot, compared with Adelie and Chinstrap.
PC2 is mainly determined by variation in the two beak variables: culmen length and depth. Chinstrap are lower than the other two species on culmen length and depth, but culmen length further distinguishes the Gentoos. A penguin biologist could almost certainly provide an explanation, but I’ll call this beak shape.