library(tidyverse)
library(vegan)
library(GGally)
library(corrplot)
 #Did not ultimately use corrplot package but kept for reference in future.
library(biotools)

Introduction

The data I am examining contain nine measurements of canine skulls from 5 related groups, separated further by sex (M/F). The group of particular interest to the study was prehistoric dogs from Thailand (whose sexes are unknown). Researchers wanted to investigate if skull shape was more or less similar between prehistoric dogs and modern groups.

Here, I investigate the question of whether canine groups differ in skull shape, and if so, does sex influence group similarities/differences? As the sexes of the prehistoric dogs are unknown, I will examine males and females from other groups to see if prehistoric dogs are more like one or the other.

Methods

Data Summary.

Data Structure.

First I loaded the data and examined the structure of variables.

# Loading data
dogs <- read.csv("Data/mandibles.csv", stringsAsFactors = TRUE)
# Summary statistics - mean, median, min, max, quartiles
summary(dogs)

##       Case            Group          G_name         X1            X2        
##  Min.   : 1.000   Min.   :1.000   cuons :17   Min.   :105   Min.   : 7.200  
##  1st Qu.: 4.000   1st Qu.:2.000   jack  :20   1st Qu.:114   1st Qu.: 8.700  
##  Median : 8.000   Median :3.000   m_dogs:16   Median :125   Median :10.000  
##  Mean   : 8.558   Mean   :2.766   p_dogs:10   Mean   :129   Mean   : 9.961  
##  3rd Qu.:12.000   3rd Qu.:4.000   wolves:14   3rd Qu.:137   3rd Qu.:10.900  
##  Max.   :20.000   Max.   :5.000               Max.   :177   Max.   :13.400  
##        X3              X4              X5              X6             X7       
##  Min.   :17.00   Min.   :15.00   Min.   :17.00   Min.   : 6.0   Min.   :26.00  
##  1st Qu.:19.00   1st Qu.:18.00   1st Qu.:19.00   1st Qu.: 7.1   1st Qu.:30.00  
##  Median :22.00   Median :22.00   Median :20.00   Median : 7.9   Median :31.00  
##  Mean   :21.95   Mean   :21.49   Mean   :20.49   Mean   : 8.0   Mean   :32.52  
##  3rd Qu.:25.00   3rd Qu.:24.00   3rd Qu.:22.00   3rd Qu.: 8.7   3rd Qu.:33.00  
##  Max.   :32.00   Max.   :28.00   Max.   :27.00   Max.   :10.5   Max.   :43.00  
##        X8             X9             Sex       
##  Min.   :31.0   Min.   :4.300   Min.   :0.000  
##  1st Qu.:34.0   1st Qu.:5.300   1st Qu.:1.000  
##  Median :36.0   Median :6.100   Median :1.000  
##  Mean   :37.4   Mean   :6.075   Mean   :1.286  
##  3rd Qu.:39.0   3rd Qu.:6.800   3rd Qu.:2.000  
##  Max.   :50.0   Max.   :8.500   Max.   :2.000

# Saving Sex as a factor instead of number
dogs$Sex <- as.factor(dogs$Sex)
# Structure of dataframe
str(dogs)

## 'data.frame':    77 obs. of  13 variables:
##  $ Case  : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Group : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ G_name: Factor w/ 5 levels "cuons","jack",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ X1    : int  123 137 121 130 149 125 126 125 121 122 ...
##  $ X2    : num  10.1 9.6 10.2 10.7 12 9.5 9.1 9.7 9.6 8.9 ...
##  $ X3    : int  23 19 18 24 25 23 20 19 22 20 ...
##  $ X4    : int  23 22 21 22 25 20 22 19 20 20 ...
##  $ X5    : int  19 19 21 20 21 20 19 19 18 19 ...
##  $ X6    : num  7.8 7.8 7.9 7.9 8.4 7.8 7.5 7.5 7.6 7.6 ...
##  $ X7    : int  32 32 35 32 35 33 32 32 31 31 ...
##  $ X8    : int  33 40 38 37 43 37 35 37 35 35 ...
##  $ X9    : num  5.6 5.8 6.2 5.9 6.6 6.3 5.5 6.2 5.3 5.7 ...
##  $ Sex   : Factor w/ 3 levels "0","1","2": 2 2 2 2 2 2 2 2 3 3 ...

There are nine response variables of various measurements, three groups for sex, and five animal groups, as follows:

Response Variables:

X1: length of mandible
X2: breadth of mandible below 1st molar
X3: breadth of articular condyle
X4: height of mandible below 1st molar
X5: length of 1st molar
X6: breadth of 1st molar
X7: length of 1st to 3rd molar inclusive (1st to 2nd for cuon)
X8: length from 1st to 4th premolar inclusive
X9: breadth of lower canine

Groups:

cuons: cuons, n = 17
jack: golden jackals, n = 20
m_dogs: modern Thai dogs, n = 16
p_dogs: prehistoric Thai dogs, n = 10
wolves: Indian wolves, n = 14

Sex:

Male
Female
Unknown (all Prehistoric Thai dog skulls are classified as unknown, as the sex of the animal is not certain).

Data Distribution and Correlation.

Canine Groups.

#dogs.corr <- cor(dogs[,-c(1:3,13)])
#corrplot(dogs.corr, type = "upper")
# Decided not to use corrplot here, as it would require a plot for each group, and instead I kept with the ggpairs default output. Kept here for reference in the future.

# Pairs plot of response variables, grouped by Canine Group
ggpairs(dogs[,-c(1:3,13)], aes(col = dogs$G_name))

Sex.

# Pairs plot of response variables, grouped by Sex
ggpairs(dogs[,-c(1:3,13)], aes(col = dogs$Sex))

The response variables are all roughly normally distributed when grouped by canine group (1) and by sex (2), as seen in the colored distribution plots (diagonals) and the elliptical shape of points on bivariate graphs (lower diagonals). There are a few bimodal exceptions, such as X7, but for the most part there are not any stark divergences from normality. Upper diagonal plots show the Pearson’s correlation coefficients among variables - all are highly correlated, with most r values between 0.3 and 0.9.

Our response variables being correlated is not surprising, as they are all measurements of different parts of canine jaws. The correlation justifies the use of multivariate methods to analyze similarities in jaw shape.

Principle Components Analysis (PCA).

These data have a large number of response variables that each describe similar characteristics (i.e., different measurements of skull shape) and are closely correlated with one another. A Principle Components Analysis (PCA) is a good way to visualize the data in fewer dimensions while still retaining the relevant information to distinguish groups. Since there are 9 variables, there will be 9 Principle Components, but the first 2 or 3 will likely explain the majority of the variation in the data.

I will begin by standardizing the data to z-scores and finding the covariance matrix of the standardized values (i.e., the correlation matrix of the untransformed data). Using the prcomp() function, I will use singular value decomposition to produce the eigenvectors (PCs) and eigenvalues (variance) of the covariance matrix. I will use the eigenvectors with the largest associated eigenvalues, as they explain the greatest proportion of the variance in the data. I will verify that the PCA summarizes the data appropriately by examining the screeplot, which shows the diminishing returns of variance explained by each additional principle component beyond the previous.

To determine how the PCs are loaded, I will examine the eigenvectors and assess the value and direction (positive or negative) of each variable. This provides information on which variables are contributing to each PC, and how they are interacting with other variables.

Finally, I will use the PC scores from the first 2 or 3 PCs to create ordination plots of the data to look for visual patterns among canine groups and sex.

If time permits, I will run a Null Hypothesis Significance Test (NHST) to determine skull shape differs among groups. If the data satisfy the assumptions, I will conduct a MANOVA - otherwise, I will use PERMANOVA.

NHST.

Since the data are approximately normal, I tested for equal dispersion with Box’s M tests for homogeneity of covariance matrices, with group and sex.

# Box's M
dogs.Mgroup <- boxM(data = dogs[,-c(1:3, 13)], group = dogs[,3])
dogs.Mgroup

## 
##  Box's M-test for Homogeneity of Covariance Matrices
## 
## data:  dogs[, -c(1:3, 13)]
## Chi-Sq (approx.) = 263.21, df = 180, p-value = 5.168e-05

# Box's M
dogs.Msex <- boxM(data = dogs[,-c(1:3, 13)], group = dogs[,13])
dogs.Msex

## 
##  Box's M-test for Homogeneity of Covariance Matrices
## 
## data:  dogs[, -c(1:3, 13)]
## Chi-Sq (approx.) = 151, df = 90, p-value = 6.016e-05

Both when grouped by sex and canine group, I rejected the null hypothesis of equal covariance matrices (sex: approx. \(\chi^2\)₁₈₀ = 263, p < 0.001, group: approx. \(\chi^2\)~90 = 151, p < 0.001). Thus, these data do not have equal dispersion and do not satisfy the assumptions of a MANOVA.

Results

PCA

dogs.pca <- prcomp(dogs[,-c(1:3,13)], scale = TRUE)
summary(dogs.pca)

## Importance of components:
##                           PC1     PC2    PC3     PC4     PC5     PC6     PC7
## Standard deviation     2.6984 0.85411 0.5825 0.43686 0.39547 0.35487 0.29733
## Proportion of Variance 0.8091 0.08106 0.0377 0.02121 0.01738 0.01399 0.00982
## Cumulative Proportion  0.8091 0.89012 0.9278 0.94903 0.96641 0.98040 0.99022
##                            PC8     PC9
## Standard deviation     0.26061 0.14179
## Proportion of Variance 0.00755 0.00223
## Cumulative Proportion  0.99777 1.00000

dogs.pca$rotation

##          PC1         PC2        PC3         PC4         PC5         PC6
## X1 0.3590984  0.11112340 -0.2766445  0.01545183  0.24634764  0.06545618
## X2 0.3384491 -0.30564465  0.2480607  0.23413574 -0.34199021 -0.40428188
## X3 0.3151604 -0.29805054 -0.7387804 -0.01060306 -0.33142926  0.19600368
## X4 0.3224791 -0.42532066  0.1965448  0.57557569  0.28094851  0.14822172
## X5 0.3489865  0.13631684 -0.0506776 -0.39100267 -0.23240386 -0.32187746
## X6 0.3408897 -0.06638074  0.3896585 -0.39835483 -0.06578077  0.71816954
## X7 0.2767507  0.71556933  0.1048982  0.44845727 -0.37450607  0.13045374
## X8 0.3410511  0.28904061 -0.1740368 -0.02400614  0.65575048 -0.13037345
## X9 0.3496822 -0.07699699  0.2860297 -0.31667337  0.07054050 -0.34423324
##            PC7         PC8         PC9
## X1  0.14749751 -0.03899580  0.83287523
## X2  0.19611815 -0.59577647  0.04320702
## X3  0.16121648  0.11002899 -0.28208318
## X4 -0.37282094  0.32184798 -0.04134049
## X5 -0.73460738  0.06712074  0.04904065
## X6  0.01791935 -0.21401315 -0.05148032
## X7  0.08772005  0.15840767 -0.09587124
## X8  0.02845065 -0.34750789 -0.44799272
## X9  0.47573374  0.57591291 -0.09070875

A summary of the PCA shows the standard deviation and proportion of variance explained by each of the 9 principle components. PC1 explains a large majority of the variance in the dataset - 80.9% - and has roughly equal, positive loading for all 9 variables, with a range of 0.277 (X7) to 0.359 (X1).

The second PC explains a further 8.1% of the variance in skull shape but weights four variables positively (X1, X5, X7, and X8) and the other five negatively (X2, X3, X4, X6, and X9).

# Calculating proportion of variance
foo<- eigenvals(dogs.pca)/sum(eigenvals(dogs.pca))

# Screeplot
plot(foo, type = "l",
     ylab = "Proportion of Variance Explained",
     xlab = "Principle Component")
points(foo, pch = 21, bg = "white")

A scree plot of the proportion of variance explained by each principle component shows the diminishing returns as each principle component is included. The majority of variance can be shown with PC1, with a smaller portion explained by PC2, and the remaining 10% spread across PCs 3 through 9.

biplot(dogs.pca)

The biplot of PCs 1 and 2 visualizes how the loadings are weighted for each variable in the first two principle components. All variables are weighted together in the first PC to similar degrees, while in PC2, more contrast in some variable vs others is represented by the positive and negative loading of different variables.

dogs.pca$rotation[,1:2]

##          PC1         PC2
## X1 0.3590984  0.11112340
## X2 0.3384491 -0.30564465
## X3 0.3151604 -0.29805054
## X4 0.3224791 -0.42532066
## X5 0.3489865  0.13631684
## X6 0.3408897 -0.06638074
## X7 0.2767507  0.71556933
## X8 0.3410511  0.28904061
## X9 0.3496822 -0.07699699

The original datapoints can be translated into scores for the principle components using the eigenvector components as the coefficients for the PC score equations and the z-scores for each variable:

Eq. 2: \[Z_1 = 0.359*X1+0.338X_2+0.315X_3+0.322X_4+0.349X_5+0.341X_6+0.277X_7+0.341X_8+0.350X_9\]

For PC1, all variables are weighted positively and between 0.27 and 0.36. This reflects the strong correlation between our response variables, as one measurement of a jaw is likely to change in accordance with other jaw measurements.

Eq. 2: \[Z_2 = 0.111X_1-0.305X_2-0.298X_3-0.425X_4+0.136X_5-0.066X_6+0.716X_7+0.289X_8-0.077X_9\]

PC2 shows more contrast between our variables, loading four positively (length of mandible, length of 1st molar, length of 1st to 3rd molar, and length from 1st to 4th premolar) and five variables negatively (breadth of mandible, breadth of articular condyle, height of mandible, breadth of 1st molar, and breadth of lower canine). Therefore, PC2 is highlighting the variation between length measurements of the jaw and breadth/height measurements. Unlike PC1, not all variables are weighted equally, but X7 (length of 1st to 3rd premolar) and X4 (height of mandible) are more heavily loaded than other variables.

Ordination Plots

Canine Group + Sex

# Creating dataframe of PC1 and PC2 to plot.
dogs.plot.df <- data.frame(Group = dogs$G_name,
                           Sex = dogs$Sex,
                           PC1 = dogs.pca$x[,1],
                           PC2 = dogs.pca$x[,2])

# Constructing ordination plot.
dogs.plot <- dogs.plot.df |>
  ggplot(aes(x = PC1, y = PC2, color = Group, shape = Sex)) +
  geom_point()
dogs.plot

PCA plot for first two principle components (PC1 = 80.9% of variation, PC2 = 8.1% of variation). Cuons, jackals, and wolves appear quite distinct in skull shape from one another and from Thai dogs, grouping into distinct regions, but there is a fair amount of overlap for modern Thai dogs and prehistoric dogs. Jackals and wolves have the greatest contrast along PC1, which implies from the equation for PC1 scores (Eq. 1) that those two group differ most in overall skull shape, since PC1 loads all variables roughly evenly. Similarly, cuons differ the most from the other four groups long PC2, which highlights the differences between height and breadth measurements (Eq. 2). Within some species, most notably Indian wolves, sex has a slight difference in both PC1 and less so in PC2. A NHST would be necessary to determine if it these apparent differences are significant, as they are only slight patterns.

I created a separate plot colored by sex alone to better visualize any grouping across species.

Sex Only

dogs.plot2 <- dogs.plot.df |>
  ggplot(aes(x = PC1, y = PC2, color = Sex, shape = Sex)) +
  geom_point()
dogs.plot2

Male and female skull shapes do not appear to differ across all dog types nor within types. Prehistoric dogs (sex unknown) have a more localized range, reflecting the canine groups in the figure above, but do not seem more or less like male or female skulls.

Conclusion

Skull shape appears to cluster by animal group for cuons, jackals, Indian wolves, and Thai dogs, but there is strong overlap between modern Thai dogs and prehistoric Thai dogs. This suggests that prehistoric dogs are more like modern Thai dogs than the other groups in skull shape. There appears to be no evidence that prehistoric dogs are more like male or female skulls across groups or within groups.

A null hypothesis significance test will be needed to assess whether these groups are truly different from one another.

Reflection

I still need to beautify my plots and polish up my interpretations, but the process of making and understanding the PCA felt pretty straightforward. I did not have time to run through the NHST with sufficient explanation along the way, though I plan to include that for the final version. Ordination itself makes a lot of sense to me now.

Final: I did not get around to testing the differences between groups, though I saw in the peer reviews that others found significant differences between animal groups and sexes. I would be interested to follow this up with a pairwise test or Tukey HSD on PC1 for sex as a factor - because prehistoric dogs are all classified as a third unknown sex, I wonder if the significant result is from an actual difference between male and female skulls or just the difference between prehistoric dogs and the other four groups.

Honestly, I was also a bit disappointed to see no one peer reviewed my assignment… I had been looking forward to some varied feedback, both on my interpretations but also my assignment structure and appearance.

Assignment 5: PCA

Luke Ghallahorne

2024-02-22