Penguin data: Multivariate EDA

Michael Friendly

Penguins are multivariate creatures! What can we learn from this dataset without specialized biological knowledge?

Here we examine a basic set of multivariate views of these pengiuns.

This analysis gives a new perspective on this data set. We will see that there are simple multivariate explanations to the variation of pengiuns, both between species and within a given species. You can learn a lot about penguins from the comfort of your PC with a few simple techniques.

Load the penguins data and do some clean up

These examples use the basic penguins dataset.

data(penguins, package="palmerpenguins")

For this example:

peng <- penguins %>%
    rename(
         culmen_length = culmen_length_mm, 
         culmen_depth = culmen_depth_mm, 
         flipper_length = flipper_length_mm, 
         body_mass = body_mass_g
         ) %>%
  mutate(species = as.factor(species),
         island = as.factor(island),
         sex = as.factor(substr(sex,1,1))) %>%
  filter(!is.na(culmen_depth))

Take a peek at the tibble:

peng
## # A tibble: 342 x 7
##    species island    culmen_length culmen_depth flipper_length body_mass sex  
##    <fct>   <fct>             <dbl>        <dbl>          <int>     <int> <fct>
##  1 Adelie  Torgersen          39.1         18.7            181      3750 M    
##  2 Adelie  Torgersen          39.5         17.4            186      3800 F    
##  3 Adelie  Torgersen          40.3         18              195      3250 F    
##  4 Adelie  Torgersen          36.7         19.3            193      3450 F    
##  5 Adelie  Torgersen          39.3         20.6            190      3650 M    
##  6 Adelie  Torgersen          38.9         17.8            181      3625 F    
##  7 Adelie  Torgersen          39.2         19.6            195      4675 M    
##  8 Adelie  Torgersen          34.1         18.1            193      3475 <NA> 
##  9 Adelie  Torgersen          42           20.2            190      4250 <NA> 
## 10 Adelie  Torgersen          37.8         17.1            186      3300 <NA> 
## # ... with 332 more rows

Load our packages

library(car)
library(ggbiplot)
library(GGally)

Scatterplot matrix

For multivariate data, rather than making separate bivariate scatterplots a more more comprehensive view is given by a scatterplot matrix – all pairwise plots of a set of quantitative varibles.

In base R, this would be easily done with the pairs() function.

But there are better alternatives:

scatterplotMatrix(~ culmen_length + culmen_depth + flipper_length + body_mass | species,
                  data=peng,
                  ellipse=list(levels=0.68),
                  col = scales::hue_pal()(3),
                  legend=list(coords="bottomright"))

From this, we can see some interesting penguin features:

ggpairs

GGally::ggpairs is designed to display mixtures of numeric and discrete variables with flexible choices of how variables are displayed. See the vignette [https://ggobi.github.io/ggally/articles/ggpairs.html] for some examples.

Briefly, by default, ggpairs :

This plot is somewhat messy, and could be tweaked further. I’ll just mention one other feature revealed here:

The upper triangle for the quantitative variable shows the overall correlations (ignoring species), and the within-group correlation for each species. There are three cases that illustrate Simpson’s Paradox: where the overall correlation has a different sign from the within-group correlations.

PCA

We can think of principal component analysis as a multivariate juicer: it tries to squeeze as much flavor out of a multivariate sample, when the criterion for “flavor” is the greatest amount of variance accounted for in 1, 2, 3, … dimensions (weighted sums of the variables.)

We run the PCA with prcomp(). Because the variable scales are not all commensurable, it is important to scale them to equal variance.

peng.pca <- prcomp (~ culmen_length + culmen_depth + flipper_length + body_mass,
                    data=peng,
                    na.action=na.omit,  # not actually necessary: we removed NA
                    scale. = TRUE)

peng.pca
## Standard deviations (1, .., p=4):
## [1] 1.65944 0.87893 0.60435 0.32938
## 
## Rotation (n x k) = (4 x 4):
##                     PC1        PC2      PC3      PC4
## culmen_length   0.45525 -0.5970311 -0.64430  0.14552
## culmen_depth   -0.40033 -0.7977666  0.41843 -0.16799
## flipper_length  0.57601 -0.0022822  0.23208 -0.78380
## body_mass       0.54835 -0.0843629  0.59660  0.57988

How many dimensions are useful and necessary to portray the variation among penquins on these variables? It looks like two are sufficient. (The usual simple criterion is to look for the “elbow” in a screeplot.)

screeplot(peng.pca, type = "line", lwd=3, cex=3, 
        main="Variances of PCA Components")

Biplot

The results of a PCA can best be viewed as a biplot. This shows

In the call to ggbiplot,

From this, we can see: