This is a walkthrough of how to do a GM analysis using data from Wyatt et al., 2021.
collect specimen images
landmark images in TPSdig
import TPS file into R
normalize landmarks
visualize variation
determine classification accuracy
Depending on the scale of your specimens, a standard DSLR camera may or may not be appropriate. For small specimens, a DinoLite microscope camera works well.
There are multiple ways to adjust lighting
use snake lights on the sides to eliminate shadows on large and complex specimens
change the DinoLite light filter to illuminate the specimen without glare
Using a medium color background can help add contrast between the specimen and the background. Brown on brown is much harder to photograph than brown on blue.
You should include the museum specimen number in the file name at a minimum. I like to include additional identifying information. For this study I named all imaged as “capitalized first letter of genus - species name - museum number” which looks like “Cbaileyi123456”. This is helpful because the file name is preserve as raw data and can be helpful when troubleshooting in R.
Make sure to organize your images so that each file of images will correspond to 1 TPS file
Open tpsUtl64 and go to “Select operation” then build tps file from images
Select the file of images as your input and select a name for your new file for output.
Now the “setup” tab should be available. Go to setup -> include all -> include path -> create
Tip: you can continue to edit that TPS file and randomly order your specimens. Select “randomly order specimens” on the TPS file you just made. After you complete landmarking, select “restore original order” to place the landmarked specimens in alphabetical order.
Open tpsDig, go to file -> input source -> the tps file you just created
The bottom of the window will show the image name, number of specimens in this file, number of landmarks and scale.
To place landmarks, go to the blue target at the top of the screen. Now you can click anywhere on the image to place a landmark. Tip: go to options -> select label landmarks and warn if landmark # is not the same.
To place semi-landmarks, go to the yellow pencil icon and click to start a curve. Once you have placed enough points to represent the curvature, right click on the curve go to resample curve -> number of points (whatever you’ve decided) -> by length
To set the scale, go to the image tools menu (an icon with a wrench, hammer and screwdriver), select measure -> set scale -> click two points on your scale bar -> reference length (the length you just measured) then select ok. If you did this correctly, you will see a S=(some number) at the bottom of the image to represent the scale.
library(geomorph)
allLandsU <- readland.tps("PCupperLM updated.txt", specID = "ID", warnmsg = F) #upper molars
print(allLandsU[,,1:2]) #print first two specimens
## , , 111249
##
## [,1] [,2]
## [1,] 5.010528 6.749379
## [2,] 4.450140 6.254919
## [3,] 4.664406 5.818146
## [4,] 5.167107 5.661567
## [5,] 5.925279 5.785182
## [6,] 5.389614 6.395016
## [7,] 4.219392 6.864753
## [8,] 3.988644 5.884074
## [9,] 3.988644 6.650487
## [10,] 3.749655 6.922440
## [11,] 3.626040 5.925279
## [12,] 3.205749 6.905958
## [13,] 3.106857 5.941761
## [14,] 2.999724 6.666969
## [15,] 2.834904 6.889476
## [16,] 2.736012 6.007689
## [17,] 2.208588 6.790584
## [18,] 2.389890 6.123063
## [19,] 1.994322 6.683451
## [20,] 2.076732 6.131304
##
## , , 137333
##
## [,1] [,2]
## [1,] 5.464087 6.836198
## [2,] 5.033780 6.438367
## [3,] 5.131208 6.016179
## [4,] 5.772609 5.691419
## [5,] 6.462724 5.983703
## [6,] 6.137964 6.519557
## [7,] 4.863281 6.966102
## [8,] 4.603473 5.999941
## [9,] 4.619711 6.803722
## [10,] 4.303070 7.039173
## [11,] 4.246237 6.097369
## [12,] 3.815930 7.104125
## [13,] 3.710383 6.202916
## [14,] 3.637312 6.876793
## [15,] 3.401861 7.071649
## [16,] 3.409980 6.243511
## [17,] 2.760460 6.966102
## [18,] 3.150172 6.511438
## [19,] 2.581842 6.779365
## [20,] 2.930959 6.292225
The TPS file format lists the specimen ID first, then lists the x,y coordinates (columns) for each landmark (rows).
Note: TPS files are prone to becoming corrupted. I converted my TPS files into plain text files to avoid this. The TPS format is retained so R reads it the same way
I recommend you also make a separate CSV file that the TPS file ID with the genus and species identities for grouping later.
all.specimens <- read.csv("SpecID.csv")
genusID <- all.specimens$Genus
speciesID <- all.specimens$Species
head(all.specimens)
We use a Generalized Procrustes Analysis to normalize the landmarks and create the centroid (represented by the black dots) from each landmark (the grey dots).
#Procrustes Analysis on Perognathus/Chaetodipus Upper Tooth Row
procAllup <- gpagen(allLandsU, print.progress = F)
plotAllSpecimens(procAllup$coords, label= TRUE)
Note, even if you specified scale in your TPS file, this step removes scale form each specimen. However there is a centroid size listed as Csize in the output. This is not a representation of the original specimen size, but rather the square root of the sum of the squared distances of each landmark to the centroid.
str(procAllup, list.len = 2) #see the first 2 attributes of the data
## List of 15
## $ coords : num [1:20, 1:2, 1:163] 0.231 0.142 0.192 0.288 0.423 ...
## ..- attr(*, "dimnames")=List of 3
## .. ..$ : chr [1:20] "1" "2" "3" "4" ...
## .. ..$ : chr [1:2] "X" "Y"
## .. .. [list output truncated]
## $ Csize : Named num [1:163] 5.44 5.3 5.39 5.01 5.51 ...
## ..- attr(*, "names")= chr [1:163] "111249 " "137333" "141932 " "147286 " ...
## [list output truncated]
## - attr(*, "class")= chr "gpagen"
#Principal Components Analysis
pca.lands.allUp <- gm.prcomp(procAllup$coords)
It is common to visualize the PCA by plotting PC1 as the x-axis and PC2 as the y-axis.
library(ggplot2)
#dataframe for ggplot
pca.lands.allUp.scores <- as.data.frame(pca.lands.allUp$x)
pca.lands.allUp$genus <- row.names(pca.lands.allUp$x)
#Assigns colors to group
PC.color = rep(NA, length=length(all.specimens$Genus))
PC.color[which(all.specimens$Genus=="Perognathus")] = "red"
PC.color[which(all.specimens$Genus=="Chaetodipus")] = "blue"
#plot PC1 and PC2 using ggplot
ggplot(data = pca.lands.allUp.scores, aes(x = Comp1, y=Comp2, group = genusID, fill = PC.color, color = PC.color)) +
geom_point(shape = 21, colour = "black", size = 3, stroke = 0.5, show.legend = F) +
scale_colour_manual(values = c("blue","red")) +
scale_fill_manual(values = c("blue","red")) +
xlab("PC1 (17%)") +
ylab("PC2 (15%)") +
theme_classic() +
theme(axis.text=element_text(size=14), axis.title=element_text(size=14, face="bold"),
plot.title=element_text(hjust=0.5, size=18, face="bold"))+
ggtitle("PCA Upper Tooth Row")
As the plot stands currently, you cannot describe the morphological variation along each axis. We can plot the theoretical shapes at the minimum at maximum of each axis.
par(mfrow=c(2,2))
#print PC min and max for illustrator plotting
ref <- mshape(procAllup$coords)
GP <- gridPar(pt.bg = "black", pt.size = 1, tar.pt.size = 1, grid.col = "gray60")
plotRefToTarget(ref,pca.lands.allUp$shapes$shapes.comp1$min, gridPars=GP)
plotRefToTarget(ref,pca.lands.allUp$shapes$shapes.comp1$max, gridPars=GP)
plotRefToTarget(ref,pca.lands.allUp$shapes$shapes.comp2$min, gridPars=GP)
plotRefToTarget(ref,pca.lands.allUp$shapes$shapes.comp2$max, gridPars=GP)
Then using either illustrator or powerpoint, you can arrange those theoretical shapes along the axes.
If the goal of your study involves assessing if shape can predict a biological attribute, then a discriminate function analysis (DFA) is the next step. The Canonical Variates Analysis (CVA) is a type of DFA used to determine the classification accuracy of shape.
There are many PC axes produced in a PCA but most are just noise and are therefore not significant for classification. We can use a broken stick model to determine how many PCs are significant.
pov <- (pca.lands.allUp$sdev^2)/sum(pca.lands.allUp$sdev^2) #proportion of variance
plot(pov)
abline(h=1/36) #Broken stick model
SigPC <- 9 #Edit every time
Now we can input the significant principal component axes into the CVA to classify either genus or species. The output gives the classification accuracy.
library(Morpho)
cva.PCupG<-CVA(dataarray=pca.lands.allUp$x[,1:9], groups= genusID, weighting = T, plot = F,
rounds = 100, cv = TRUE, p.adjust.method = "none",robust = c("classical"))
cva.PCupG
## cross-validated classification results in frequencies
##
## Chaetodipus Perognathus
## Chaetodipus 70 13
## Perognathus 15 65
##
##
## cross-validated classification result in %
##
## Chaetodipus Perognathus
## Chaetodipus 84.337 15.663
## Perognathus 18.750 81.250
##
##
## overall classification accuracy: 82.82209 %
##
## Kappa statistic: 0.65617
The centroid size is not included in the PCA or CVA. If you are interested in using size as a covariate for classification with size, you can create a new dataframe with both shape and size.
#Function from Stephanie Smith (Smith and Wilson 2017)
size.pca <- function(A, Csize) { #A is the proc scores, Csize is the procrustes centroid size scores
x <- two.d.array(A)
B <- as.matrix(Csize)
C <- cbind(B, x)
pc.res <- prcomp(C)
return(pc.res)}
##PCA informed with size
PC.size.up <- size.pca(procAllup$coords,procAllup$Csize)
cva.size.UPg<-CVA(dataarray=PC.size.up$x[,1:9], groups= genusID, weighting = T, plot = F,
rounds = 100, cv = TRUE, p.adjust.method = "none",robust = c("classical"))
cva.size.UPg
## cross-validated classification results in frequencies
##
## Chaetodipus Perognathus
## Chaetodipus 78 5
## Perognathus 16 64
##
##
## cross-validated classification result in %
##
## Chaetodipus Perognathus
## Chaetodipus 93.9759 6.0241
## Perognathus 20.0000 80.0000
##
##
## overall classification accuracy: 87.11656 %
##
## Kappa statistic: 0.7416
Another method is to do a randomForest analysis instead of a CVA.
require(randomForest)
#export PC scores into CSV files using write.table(data, file = "data.csv", sep = ",")
forest.up <- read.csv("pca.lands.allUp.scores.csv", header = T, stringsAsFactors = F)
forest.up$GenusID=factor(forest.up$GenusID)
forest.up.results <- randomForest(forest.up$GenusID ~ forest.up$Comp1+forest.up$Comp2+forest.up$Comp3+forest.up$Comp4+forest.up$Comp5+forest.up$Comp6+forest.up$Comp7+forest.up$Comp8+forest.up$Comp9, importance = T, ntree = 100, proximity = T)
forest.up.results
##
## Call:
## randomForest(formula = forest.up$GenusID ~ forest.up$Comp1 + forest.up$Comp2 + forest.up$Comp3 + forest.up$Comp4 + forest.up$Comp5 + forest.up$Comp6 + forest.up$Comp7 + forest.up$Comp8 + forest.up$Comp9, importance = T, ntree = 100, proximity = T)
## Type of random forest: classification
## Number of trees: 100
## No. of variables tried at each split: 3
##
## OOB estimate of error rate: 23.93%
## Confusion matrix:
## Chaetodipus Perognathus class.error
## Chaetodipus 65 18 0.2168675
## Perognathus 21 59 0.2625000
Then similarly with size as a covariate.
forest.size.up <- read.csv("PC.size.up$x.csv", header = T, stringsAsFactors = F)
forest.size.up$GenusID=factor(forest.size.up$GenusID)
forest.size.up.results <- randomForest(forest.size.up$GenusID ~ forest.size.up$PC1+forest.size.up$PC2+forest.size.up$PC3+forest.size.up$PC4+forest.size.up$PC5+ forest.size.up$PC6+forest.size.up$PC7+forest.size.up$PC8+forest.size.up$PC9, importance = T, ntree = 100, proximity = T)
forest.size.up.results
##
## Call:
## randomForest(formula = forest.size.up$GenusID ~ forest.size.up$PC1 + forest.size.up$PC2 + forest.size.up$PC3 + forest.size.up$PC4 + forest.size.up$PC5 + forest.size.up$PC6 + forest.size.up$PC7 + forest.size.up$PC8 + forest.size.up$PC9, importance = T, ntree = 100, proximity = T)
## Type of random forest: classification
## Number of trees: 100
## No. of variables tried at each split: 3
##
## OOB estimate of error rate: 13.5%
## Confusion matrix:
## Chaetodipus Perognathus class.error
## Chaetodipus 74 9 0.1084337
## Perognathus 13 67 0.1625000