This script contains code for calculating similarity coefficients and visualising them in a UPGMA tree.

First, the divSimCo function from diveRsity is used to calculate similarity coefficients for all pairs of individuals, according to Kosman & Leonard (2005).

library("diveRsity")
glb <- divSimCo("../data/monomorphism_main.gen")$global

These distances can now be clustered hierarchically using any appropriate algorithm. Here, the UPGMA method is used.

library("ape")
phyloTree <- as.phylo(hclust(as.dist(glb), method = "average"))

These results are now ready to visualise in a tree. However, it is useful to be able to colour-code tip labels so patterns of genetic similarity can be compared to original sample assignment, inferred group assignment, or some other trait, such as a phenotype etc.

Here, the original sample names will be used to colour code tip labels, and later, inferred population assignment.

Original sample tip label

Individual names in ‘monomorphism_main.gen’ (infile) contains a three digit prefix, which refers to the original sample locations. These should be extracted for each individual and their full name in plyloTree$tip.label replaced. Then a unique colour for each unique sample location can be assigned to each tip label.

# strip names refering to the level to be visualised (e.g. site name)
phyloTree$tip.label <- substr(phyloTree$tip.label, 1, 3)
# create a palette of colours for each unique factor
colLevels <- factor(unique(phyloTree$tip.label))
cols <- rainbow(length(colLevels))
# generate a vector of colours for each tip
tipCols <- sapply(phyloTree$tip.label, function(x){
  cols[which(levels(colLevels) == x)]
})

The vector tipCols can be passed to the tip.color argument in plot.phylo. This will result in each individual being coloured in a site specific way.

plot.phylo(phyloTree, label.offset = 0.01, edge.width = 3,
           tip.color = tipCols, type = "f", cex = 1)

Inferred group colouring

If sample site is not a particularly meaningful visualisation factor, then variables like ‘inferred’ population might be used.

The first thing to do is generate individual assignments to inferred groups. These can be derived from a number of different software such as STRUCTURE or BAPS. Here, a multivariate approach implemented in adegenet is used.

Finding k

First, run find.clusters to see how model goodness of fit changes with k.

library("adegenet")
# read the genepop file
gp <- read.genepop("../data/monomorphism_main.gen")
# Visualise BIC for k = 1:20 initially. (All PCAs retained)
grp <- find.clusters(gp, n.pca = 1000)

BIC

This initial analysis strongly indicates \(k=2\). This can be tested quantitatively using Ward’s method to identify the most likely \(k\).

grp <- find.clusters(gp, n.pca = 1000, choose.n.clust = FALSE, 
                     criterion = "diffNgroup")

The objective value for \(k\) can be found as follows:

length(unique(grp$grp))
## [1] 2

Now that we have the individuals assignments for inferred populations groups, these can be used to define a new tip colour vector.

# re-calculate the tree from the distance matrix
phyloTree <- as.phylo(hclust(as.dist(glb), method = "average"))
# get inferred group assignments and replace tip.labels
for(i in 1:length(grp$grp)){
  idx <- which(phyloTree$tip.label == names(grp$grp[i]))
  phyloTree$tip.label[idx] <- paste("pop_", as.numeric(grp$grp[i]), sep = "")
}
# Get the unique colour levels
colLevels <- factor(unique(phyloTree$tip.label))
# create a colour palette for the colour levels
cols <- rainbow(length(colLevels))
tipCols <- sapply(phyloTree$tip.label, function(x){
  cols[which(levels(colLevels) == x)]
})

This vector can now be passed to plot.phylo to colour code tip labels.

plot.phylo(phyloTree, label.offset = 0.01, edge.width = 3,
           tip.color = tipCols, type = "f", cex = 1.5)

References

Kosman E, Leonard K (2005) Similarity coefficients for molecular markers in studies of genetic relationships between individuals for haploid, diploid, and polyploid species. Molecular ecology, 14, 415-424.