This is a walkthrough of how to do a GM analysis using data from Wyatt et al., 2021.

Overview

collect specimen images
landmark images in TPSdig
import TPS file into R
normalize landmarks
visualize variation
determine classification accuracy

Collect Specimen Images

Depending on the scale of your specimens, a standard DSLR camera may or may not be appropriate. For small specimens, a DinoLite microscope camera works well.

Tips for microscope imaging

Include a scale bar in the image

Adjust the lighting

There are multiple ways to adjust lighting

use snake lights on the sides to eliminate shadows on large and complex specimens
change the DinoLite light filter to illuminate the specimen without glare

Use colored paper to adjust the white balance

Using a medium color background can help add contrast between the specimen and the background. Brown on brown is much harder to photograph than brown on blue.

Use a standard naming convention for image files

You should include the museum specimen number in the file name at a minimum. I like to include additional identifying information. For this study I named all imaged as “capitalized first letter of genus - species name - museum number” which looks like “Cbaileyi123456”. This is helpful because the file name is preserve as raw data and can be helpful when troubleshooting in R.

Landmark images in TPSdig

Getting started

Download tpsDig and tpsUtl
Make sure to organize your images so that each file of images will correspond to 1 TPS file
Open tpsUtl64 and go to “Select operation” then build tps file from images
Select the file of images as your input and select a name for your new file for output.
Now the “setup” tab should be available. Go to setup -> include all -> include path -> create
Tip: you can continue to edit that TPS file and randomly order your specimens. Select “randomly order specimens” on the TPS file you just made. After you complete landmarking, select “restore original order” to place the landmarked specimens in alphabetical order.

Landmarking

Open tpsDig, go to file -> input source -> the tps file you just created
The bottom of the window will show the image name, number of specimens in this file, number of landmarks and scale.
To place landmarks, go to the blue target at the top of the screen. Now you can click anywhere on the image to place a landmark. Tip: go to options -> select label landmarks and warn if landmark # is not the same.
To place semi-landmarks, go to the yellow pencil icon and click to start a curve. Once you have placed enough points to represent the curvature, right click on the curve go to resample curve -> number of points (whatever you’ve decided) -> by length
To set the scale, go to the image tools menu (an icon with a wrench, hammer and screwdriver), select measure -> set scale -> click two points on your scale bar -> reference length (the length you just measured) then select ok. If you did this correctly, you will see a S=(some number) at the bottom of the image to represent the scale.

Import data into R

library(geomorph)

allLandsU <- readland.tps("PCupperLM updated.txt", specID = "ID", warnmsg = F) #upper molars
print(allLandsU[,,1:2]) #print first two specimens

## , , 111249 
## 
##           [,1]     [,2]
##  [1,] 5.010528 6.749379
##  [2,] 4.450140 6.254919
##  [3,] 4.664406 5.818146
##  [4,] 5.167107 5.661567
##  [5,] 5.925279 5.785182
##  [6,] 5.389614 6.395016
##  [7,] 4.219392 6.864753
##  [8,] 3.988644 5.884074
##  [9,] 3.988644 6.650487
## [10,] 3.749655 6.922440
## [11,] 3.626040 5.925279
## [12,] 3.205749 6.905958
## [13,] 3.106857 5.941761
## [14,] 2.999724 6.666969
## [15,] 2.834904 6.889476
## [16,] 2.736012 6.007689
## [17,] 2.208588 6.790584
## [18,] 2.389890 6.123063
## [19,] 1.994322 6.683451
## [20,] 2.076732 6.131304
## 
## , , 137333
## 
##           [,1]     [,2]
##  [1,] 5.464087 6.836198
##  [2,] 5.033780 6.438367
##  [3,] 5.131208 6.016179
##  [4,] 5.772609 5.691419
##  [5,] 6.462724 5.983703
##  [6,] 6.137964 6.519557
##  [7,] 4.863281 6.966102
##  [8,] 4.603473 5.999941
##  [9,] 4.619711 6.803722
## [10,] 4.303070 7.039173
## [11,] 4.246237 6.097369
## [12,] 3.815930 7.104125
## [13,] 3.710383 6.202916
## [14,] 3.637312 6.876793
## [15,] 3.401861 7.071649
## [16,] 3.409980 6.243511
## [17,] 2.760460 6.966102
## [18,] 3.150172 6.511438
## [19,] 2.581842 6.779365
## [20,] 2.930959 6.292225

The TPS file format lists the specimen ID first, then lists the x,y coordinates (columns) for each landmark (rows).

Note: TPS files are prone to becoming corrupted. I converted my TPS files into plain text files to avoid this. The TPS format is retained so R reads it the same way

I recommend you also make a separate CSV file that the TPS file ID with the genus and species identities for grouping later.

all.specimens <- read.csv("SpecID.csv")
genusID <- all.specimens$Genus
speciesID <- all.specimens$Species

head(all.specimens)

Normalize landmarks

We use a Generalized Procrustes Analysis to normalize the landmarks and create the centroid (represented by the black dots) from each landmark (the grey dots).

#Procrustes Analysis on Perognathus/Chaetodipus Upper Tooth Row
procAllup <- gpagen(allLandsU, print.progress = F)
plotAllSpecimens(procAllup$coords, label= TRUE)

Note, even if you specified scale in your TPS file, this step removes scale form each specimen. However there is a centroid size listed as Csize in the output. This is not a representation of the original specimen size, but rather the square root of the sum of the squared distances of each landmark to the centroid.

str(procAllup, list.len = 2) #see the first 2 attributes of the data

## List of 15
##  $ coords      : num [1:20, 1:2, 1:163] 0.231 0.142 0.192 0.288 0.423 ...
##   ..- attr(*, "dimnames")=List of 3
##   .. ..$ : chr [1:20] "1" "2" "3" "4" ...
##   .. ..$ : chr [1:2] "X" "Y"
##   .. .. [list output truncated]
##  $ Csize       : Named num [1:163] 5.44 5.3 5.39 5.01 5.51 ...
##   ..- attr(*, "names")= chr [1:163] "111249 " "137333" "141932 " "147286 " ...
##   [list output truncated]
##  - attr(*, "class")= chr "gpagen"

Visualize variation

#Principal Components Analysis
pca.lands.allUp <- gm.prcomp(procAllup$coords)

It is common to visualize the PCA by plotting PC1 as the x-axis and PC2 as the y-axis.

library(ggplot2)

#dataframe for ggplot
pca.lands.allUp.scores <- as.data.frame(pca.lands.allUp$x)
pca.lands.allUp$genus <- row.names(pca.lands.allUp$x)

#Assigns colors to group
PC.color = rep(NA, length=length(all.specimens$Genus))
PC.color[which(all.specimens$Genus=="Perognathus")] = "red"
PC.color[which(all.specimens$Genus=="Chaetodipus")] = "blue"

#plot PC1 and PC2 using ggplot
ggplot(data = pca.lands.allUp.scores, aes(x = Comp1, y=Comp2, group = genusID, fill = PC.color, color = PC.color)) + 
  geom_point(shape = 21, colour = "black", size = 3, stroke = 0.5, show.legend = F) +
  scale_colour_manual(values = c("blue","red")) +
  scale_fill_manual(values = c("blue","red")) +
  xlab("PC1 (17%)") +
  ylab("PC2 (15%)") +
  theme_classic() +
  theme(axis.text=element_text(size=14), axis.title=element_text(size=14, face="bold"), 
        plot.title=element_text(hjust=0.5, size=18, face="bold"))+
  ggtitle("PCA Upper Tooth Row")

As the plot stands currently, you cannot describe the morphological variation along each axis. We can plot the theoretical shapes at the minimum at maximum of each axis.

par(mfrow=c(2,2))

#print PC min and max for illustrator plotting
ref <- mshape(procAllup$coords)
GP <- gridPar(pt.bg = "black", pt.size = 1, tar.pt.size = 1, grid.col = "gray60") 
plotRefToTarget(ref,pca.lands.allUp$shapes$shapes.comp1$min, gridPars=GP)
plotRefToTarget(ref,pca.lands.allUp$shapes$shapes.comp1$max, gridPars=GP)
plotRefToTarget(ref,pca.lands.allUp$shapes$shapes.comp2$min, gridPars=GP)
plotRefToTarget(ref,pca.lands.allUp$shapes$shapes.comp2$max, gridPars=GP)

Then using either illustrator or powerpoint, you can arrange those theoretical shapes along the axes.

Determine classification accuracy

If the goal of your study involves assessing if shape can predict a biological attribute, then a discriminate function analysis (DFA) is the next step. The Canonical Variates Analysis (CVA) is a type of DFA used to determine the classification accuracy of shape.

Significant PCs

There are many PC axes produced in a PCA but most are just noise and are therefore not significant for classification. We can use a broken stick model to determine how many PCs are significant.

pov <- (pca.lands.allUp$sdev^2)/sum(pca.lands.allUp$sdev^2) #proportion of variance
plot(pov)
abline(h=1/36) #Broken stick model

SigPC <- 9 #Edit every time

CVA

Now we can input the significant principal component axes into the CVA to classify either genus or species. The output gives the classification accuracy.

library(Morpho)
cva.PCupG<-CVA(dataarray=pca.lands.allUp$x[,1:9], groups= genusID, weighting = T, plot = F,
               rounds = 100, cv = TRUE, p.adjust.method = "none",robust = c("classical"))
cva.PCupG

##  cross-validated classification results in frequencies
##              
##               Chaetodipus Perognathus
##   Chaetodipus          70          13
##   Perognathus          15          65
## 
## 
##  cross-validated classification result in %
##              
##               Chaetodipus Perognathus
##   Chaetodipus      84.337      15.663
##   Perognathus      18.750      81.250
## 
## 
##  overall classification accuracy: 82.82209 %
## 
##  Kappa statistic: 0.65617

Size CVA

The centroid size is not included in the PCA or CVA. If you are interested in using size as a covariate for classification with size, you can create a new dataframe with both shape and size.

#Function from Stephanie Smith (Smith and Wilson 2017)
size.pca <- function(A, Csize) { #A is the proc scores, Csize is the procrustes centroid size scores
  x <- two.d.array(A)
  B <- as.matrix(Csize)
  C <- cbind(B, x)
  pc.res <- prcomp(C)
  return(pc.res)}

##PCA informed with size
PC.size.up <- size.pca(procAllup$coords,procAllup$Csize)
cva.size.UPg<-CVA(dataarray=PC.size.up$x[,1:9], groups= genusID, weighting = T, plot = F, 
                  rounds = 100, cv = TRUE, p.adjust.method = "none",robust = c("classical"))
cva.size.UPg

##  cross-validated classification results in frequencies
##              
##               Chaetodipus Perognathus
##   Chaetodipus          78           5
##   Perognathus          16          64
## 
## 
##  cross-validated classification result in %
##              
##               Chaetodipus Perognathus
##   Chaetodipus     93.9759      6.0241
##   Perognathus     20.0000     80.0000
## 
## 
##  overall classification accuracy: 87.11656 %
## 
##  Kappa statistic: 0.7416

randomForest

Another method is to do a randomForest analysis instead of a CVA.

require(randomForest)
#export PC scores into CSV files using write.table(data, file = "data.csv", sep = ",")
forest.up <- read.csv("pca.lands.allUp.scores.csv", header = T, stringsAsFactors = F)
forest.up$GenusID=factor(forest.up$GenusID)

forest.up.results <- randomForest(forest.up$GenusID ~ forest.up$Comp1+forest.up$Comp2+forest.up$Comp3+forest.up$Comp4+forest.up$Comp5+forest.up$Comp6+forest.up$Comp7+forest.up$Comp8+forest.up$Comp9, importance = T, ntree = 100, proximity = T)
forest.up.results

## 
## Call:
##  randomForest(formula = forest.up$GenusID ~ forest.up$Comp1 +      forest.up$Comp2 + forest.up$Comp3 + forest.up$Comp4 + forest.up$Comp5 +      forest.up$Comp6 + forest.up$Comp7 + forest.up$Comp8 + forest.up$Comp9,      importance = T, ntree = 100, proximity = T) 
##                Type of random forest: classification
##                      Number of trees: 100
## No. of variables tried at each split: 3
## 
##         OOB estimate of  error rate: 23.93%
## Confusion matrix:
##             Chaetodipus Perognathus class.error
## Chaetodipus          65          18   0.2168675
## Perognathus          21          59   0.2625000

Then similarly with size as a covariate.

forest.size.up <- read.csv("PC.size.up$x.csv", header = T, stringsAsFactors = F)
forest.size.up$GenusID=factor(forest.size.up$GenusID)

forest.size.up.results <- randomForest(forest.size.up$GenusID ~ forest.size.up$PC1+forest.size.up$PC2+forest.size.up$PC3+forest.size.up$PC4+forest.size.up$PC5+ forest.size.up$PC6+forest.size.up$PC7+forest.size.up$PC8+forest.size.up$PC9, importance = T, ntree = 100, proximity = T)
forest.size.up.results

## 
## Call:
##  randomForest(formula = forest.size.up$GenusID ~ forest.size.up$PC1 +      forest.size.up$PC2 + forest.size.up$PC3 + forest.size.up$PC4 +      forest.size.up$PC5 + forest.size.up$PC6 + forest.size.up$PC7 +      forest.size.up$PC8 + forest.size.up$PC9, importance = T,      ntree = 100, proximity = T) 
##                Type of random forest: classification
##                      Number of trees: 100
## No. of variables tried at each split: 3
## 
##         OOB estimate of  error rate: 13.5%
## Confusion matrix:
##             Chaetodipus Perognathus class.error
## Chaetodipus          74           9   0.1084337
## Perognathus          13          67   0.1625000

How to use 2D GM

Megan Wyatt

June, 2021