Penguins are multivariate creatures! What can we learn from this dataset without specialized biological knowledge?
Here we examine a basic set of multivariate views of these pengiuns.
car::scatterplotMatrix
, GGally::ggpairs
)princomp()
) followed byggbiplot
), showing the observations in the space of the first two principal components (PC1, PC2) and variable vectors showing the correlations of the penguin quantitative variables with PC1 & PC2.This analysis gives a new perspective on this data set. We will see that there are simple multivariate explanations to the variation of pengiuns, both between species and within a given species. You can learn a lot about penguins from the comfort of your PC with a few simple techniques.
penguins
data and do some clean upThese examples use the basic penguins
dataset.
For this example:
peng <- penguins %>%
rename(
culmen_length = culmen_length_mm,
culmen_depth = culmen_depth_mm,
flipper_length = flipper_length_mm,
body_mass = body_mass_g
) %>%
mutate(species = as.factor(species),
island = as.factor(island),
sex = as.factor(substr(sex,1,1))) %>%
filter(!is.na(culmen_depth))
Take a peek at the tibble:
## # A tibble: 342 x 7
## species island culmen_length culmen_depth flipper_length body_mass sex
## <fct> <fct> <dbl> <dbl> <int> <int> <fct>
## 1 Adelie Torgersen 39.1 18.7 181 3750 M
## 2 Adelie Torgersen 39.5 17.4 186 3800 F
## 3 Adelie Torgersen 40.3 18 195 3250 F
## 4 Adelie Torgersen 36.7 19.3 193 3450 F
## 5 Adelie Torgersen 39.3 20.6 190 3650 M
## 6 Adelie Torgersen 38.9 17.8 181 3625 F
## 7 Adelie Torgersen 39.2 19.6 195 4675 M
## 8 Adelie Torgersen 34.1 18.1 193 3475 <NA>
## 9 Adelie Torgersen 42 20.2 190 4250 <NA>
## 10 Adelie Torgersen 37.8 17.1 186 3300 <NA>
## # ... with 332 more rows
For multivariate data, rather than making separate bivariate scatterplots a more more comprehensive view is given by a scatterplot matrix – all pairwise plots of a set of quantitative varibles.
In base R, this would be easily done with the pairs()
function.
But there are better alternatives:
car::scatterplotMatrix()
provides many options for enhancing the basic pairs()
plot.
in particular, in the model formula, x1 + x2 + ... | group
, the different groups (species
) are annotated and colored separately.
the diagonal panels show the univariate distributions for each group. This shows a smoothed density estimate for each group, and you can easily see the difference among species in terms of means, variablilty and shape.
each pairwise panel shows a data ellipse covering ~ 68% of the observations in each group.
GGally::ggpairs
extends the idea of pairs plots to include categorical variables along with quantitative ones.
scatterplotMatrix(~ culmen_length + culmen_depth + flipper_length + body_mass | species,
data=peng,
ellipse=list(levels=0.68),
col = scales::hue_pal()(3),
legend=list(coords="bottomright"))
From this, we can see some interesting penguin features:
Most of the distributions are approximately unimodal and symmetric within species. One exception is culmen_length
for Chinstrap pengiuns.
withi species, all pairs of variables are positively correlated, and the slopes of regression lines are approximately the same within each off-diagonal panel.
GGally::ggpairs
is designed to display mixtures of numeric and discrete variables with flexible choices of how variables are displayed. See the vignette [https://ggobi.github.io/ggally/articles/ggpairs.html] for some examples.
Briefly, by default, ggpairs
:
displays pairs of quantitative variables in the lower triangle as scatterplots, and the same pairs in the upper triangle with correlation coefficients, both overall and within group.
a pair of one quantitative and one discrete variable are shown either as color-coded histograms (lower triangle) or boxplots (upper triangle).
pairs of two discrete variables are shown with stacked bar charts.
ggpairs(peng, mapping = aes(color = species),
columns = c("culmen_length", "culmen_depth",
"flipper_length", "body_mass",
"island", "sex"))
This plot is somewhat messy, and could be tweaked further. I’ll just mention one other feature revealed here:
The upper triangle for the quantitative variable shows the overall correlations (ignoring species), and the within-group correlation for each species. There are three cases that illustrate Simpson’s Paradox: where the overall correlation has a different sign from the within-group correlations.
We can think of principal component analysis as a multivariate juicer: it tries to squeeze as much flavor out of a multivariate sample, when the criterion for “flavor” is the greatest amount of variance accounted for in 1, 2, 3, … dimensions (weighted sums of the variables.)
We run the PCA with prcomp()
. Because the variable scales are not all commensurable, it is important to scale them to equal variance.
peng.pca <- prcomp (~ culmen_length + culmen_depth + flipper_length + body_mass,
data=peng,
na.action=na.omit, # not actually necessary: we removed NA
scale. = TRUE)
peng.pca
## Standard deviations (1, .., p=4):
## [1] 1.65944 0.87893 0.60435 0.32938
##
## Rotation (n x k) = (4 x 4):
## PC1 PC2 PC3 PC4
## culmen_length 0.45525 -0.5970311 -0.64430 0.14552
## culmen_depth -0.40033 -0.7977666 0.41843 -0.16799
## flipper_length 0.57601 -0.0022822 0.23208 -0.78380
## body_mass 0.54835 -0.0843629 0.59660 0.57988
How many dimensions are useful and necessary to portray the variation among penquins on these variables? It looks like two are sufficient. (The usual simple criterion is to look for the “elbow” in a screeplot.)
The results of a PCA can best be viewed as a biplot. This shows
In the call to ggbiplot
,
groups = peng$species
provides separate colors and point styles for the species.ellipse = TRUE
causes a data ellipse to be drawn for each species. This shows the within-species correlations of the observation scores on PC1 & PC2.circle = TRUE
draws a correlation circle, reflecting the fact that for all species combined, PC1 & PC2 are uncorrelated.ggbiplot(peng.pca, obs.scale = 1, var.scale = 1,
groups = peng$species,
ellipse = TRUE, circle = TRUE) +
scale_color_discrete(name = 'Penguin Species') +
theme_minimal() +
theme(legend.direction = 'horizontal', legend.position = 'top')
From this, we can see:
These two principal components account for 68.8 + 19.3 = 88.1 % of the total variance of these four size variables.
PC1 is largely determined by flipper length and body mass. We can interpret this as an overall measure of penguin size.
On this dimension, Gentoos are the largest, by quite a lot, compared with Adelie and Chinstrap.
PC2 is mainly determined by variation in the two beak variables: culmen length and depth. Chinstrap are lower than the other two species on culmen length and depth, but culmen length further distinguishes the Gentoos. A penguin biologist could almost certainly provide an explanation, but I’ll call this beak shape.