Lesson 02 - Practical Guide to Morphological Phylogenetic Analysis in R

1. Introduction

In this practical guide, we will explore the fundamentals of morphological phylogenetic analysis using R. You will gain hands-on experience by learning how to utilize data from TNT in R. It is a powerful tool for reading sequences, creating distance matrices, and inferring trees. This guide will walk you through the process of calculating phylogenetic trees based on the parsimony criterion and constructing consensus trees using various methods. Additionally, you will discover how to incorporate time calibration into your phylogenetic analysis, enhancing the scope and depth of your evolutionary research.

2. Guided Activities

2.1 Loading TNT Sequences

To initiate our morphological phylogenetic analysis, the first step involves importing a file containing morphological sequences.

Locate Data Source.

For our example, we will be using the morphological data described in the paper that introduces Burkesuchus malagrandesis. It can be accessed through the following link. The related TNT sequences can be downloaded from Supplemental Material 2.

Packages

Before proceeding with importing the sequence file from TNT, ensure that TNT is installed on your system. It is also essential to load the R library TreeTools to facilitate interfacing with TNT.

3, Read Sequence File

With TNT software and the TreeTools library correctly configured, execute the necessary commands to import the morphological sequence file from TNT into R. This crucial initial step lays the groundwork for subsequent analyses, encompassing tasks such as constructing a distance matrix, computing parsimony criteria, constructing consensus trees, and integrating time calibration.

The provided code enables us to read the associated sequences.

# Load the "TreeTools" library for handling tree structures and interfacing with TNT
library("TreeTools")

# Load the "phangorn" library for phylogenetic analysis and tree manipulation
library("phangorn")

# Read the morphological sequence data of Burkesuchus from a TNT file
burkesuchus_sequences = ReadTntCharacters("secuences_burkesuchus.txt", encoding = "UTF8")

# Convert the matrix obtained from the TNT file to a PhyDat object, suitable for phylogenetic analysis in R
burkesuchus_sequences = MatrixToPhyDat(burkesuchus_sequences)

# Ensure that the object is of the class "phyDat", which is a requirement for many functions in "phangorn"
burkesuchus_sequences = as.phyDat(burkesuchus_sequences)

# Display the "phyDat" object containing the morphological sequences of Burkesuchus
burkesuchus_sequences

## 111 sequences with 448 character and 448 different site patterns.
## The states are - 0 1 2 3 4 5

2.2 Pairwise-Distance matrix

Once the morphological sequences of Burkesuchus are successfully loaded and converted into the appropriate format, the next step in our phylogenetic analysis is to create a distance matrix. We will use the Hamming distance as a metric to calculate the pairwise distances between the sequences. The Hamming distance measures the number of positions at which the corresponding symbols in two strings of equal length are different.

Calculate Hamming Distance

Utilize the appropriate function in R to calculate the Hamming distance between each pair of sequences in our burkesuchus_sequences object. This will give us a matrix of distances which will be crucial for constructing our phylogenetic tree.

# Calculate the distance matrix using Hamming distance (p distance)
distance_matrix = round(dist.hamming(burkesuchus_sequences,ratio = T),3)

Visualize the Distance Matrix

After calculating the distance matrix, it is helpful to visualize it. We will use the plotly library for this purpose, as it allows for interactive and aesthetically pleasing visualizations. Make sure to have the plotly library installed and loaded.

# Install and load the plotly library for visualization
library("plotly") # Uncomment to install: install.packages("plotly")
library("reshape2")

# Melt the distance matrix into a long format suitable for plotly visualization
dist_melted = melt(as.matrix(distance_matrix))

# Create the heatmap with plot_ly
heatmap_plot = plot_ly(
  x = dist_melted$Var1,  # x-axis labels
  y = dist_melted$Var2,  # y-axis labels
  z = dist_melted$value, # Distance values
  type = "heatmap",      # Specify the type of plot as heatmap
  colorscale = list(
    c(0, "white"),        # Set the minimum value of the scale to white
    c(1, "#00A499")       # Set the maximum value of the scale to #00A499
  )
)

# Add a title to the plot and adjust the font size of axis ticks
heatmap_plot <- layout(
  heatmap_plot,
  title = list(text = "Sequence distance - B. mallingrandensis", font = list(size = 12)),
  xaxis = list(tickfont = list(size = 6)), # Reduced font size for x-axis ticks
  yaxis = list(tickfont = list(size = 6))  # Reduced font size for y-axis ticks
)

# Adjust the font size of the axis titles
heatmap_plot$x$axis$title = list(font = list(size = 12))
heatmap_plot$y$axis$title = list(font = list(size = 12))

# Display the plot
heatmap_plot

2.3 Base NJ tree

Once we have the distance matrix at our disposal, we are ready to proceed with the construction of the Neighbour Joining (NJ) tree. The NJ method is a powerful tool for inferring phylogenetic relationships and can be applied in the following manner:

# Load the necessary library for creating and visualizing phylogenetic trees
library("ggtree")

# Build the Neighbour Joining tree using the calculated distance matrix
nj_tree = nj(distance_matrix)

# Convert multifurcating tree to bifurcating tree
nj_tree = multi2di(nj_tree)

# Adjust negative branch lengths to zero
nj_tree$edge.length[which(nj_tree$edge.length < 0)] = 0

# Re-root the tree at its midpoint for better visualization
nj_tree = midpoint(nj_tree)

# Initialize a color vector for species, default color is black
species_colors = rep("black", length(nj_tree$tip.label))

# Assign a specific color (#c7254e) to the 111th tip label
species_colors[111] = "#c7254e"

# Create a ggtree object for visualization and apply the ladderize function for ordering
g = ggtree(nj_tree, layout = "dendrogram", ladderize = TRUE)

# Add tip labels with specified colors
g = g + geom_tiplab(size = 4, color = species_colors, align = F)

# Adjust plot margins and set the legend position at the top
g = g + theme(plot.margin = margin(t = 0, r = 0, b = 150, l = 0, unit = "pt"), legend.position = "top")

# Add colored points to nodes and set legend position at the top
g = g + geom_nodepoint(size = 4, color = "#c7254e") + theme(legend.position = 'top')

# Add title to the plot
g = g + labs(title = "NJ tree")

# Display the final plot
plot(g)

2.4 Parsimony tree

Begin by establishing the concept of parsimony criteria. Parsimony, as a fundamental method in phylogenetic analysis, aims to construct phylogenetic trees by minimizing the overall tree length. The tree length is defined as the sum of branch lengths and serves as a representation of evolutionary changes. To search for a parsimony tree, the following approach can be employed. It’s important to note that, in this particular case, we won’t delve into exploring optimal configurations for the related parameters. Our focus is primarily on introducing the fundamental concept of parsimony.

# Optimizing parsimony tree

# Set the seed for reproducibility. This ensures that any operation that relies on randomness will give the same result each time.
set.seed(1)

# Using the optim.parsimony function, optimize the topology and branch lengths of the starting NJ (Neighbor Joining) tree 
# with respect to the given morphological sequences of Burkesuchus to create a parsimony tree.
# - nj_tree: the starting Neighbor Joining tree.
# - burkesuchus_sequences: the morphological sequences of Burkesuchus.
parsimony_tree = optim.parsimony(nj_tree, burkesuchus_sequences,trace = 0)

# Convert multifurcating tree to a strictly bifurcating tree using multi2di function. 
# This function collapses nodes with a degree of 2 and adjusts the branch lengths accordingly.
parsimony_tree = multi2di(parsimony_tree)
parsimony_tree = acctran(parsimony_tree,burkesuchus_sequences)
parsimony_tree = midpoint(parsimony_tree)


# Calculate and print the parsimony score of the optimized tree with respect to the given sequences.
# This score represents the minimum number of character-state changes needed to explain the observed variation among the sequences.
parsimony(parsimony_tree, burkesuchus_sequences)

## [1] 1678

# Create a ggtree object for visualization and apply the ladderize function for ordering
g = ggtree(parsimony_tree, layout = "dendrogram", ladderize = TRUE)

# Add tip labels with specified colors
g = g + geom_tiplab(size = 4, color = species_colors, align = F)

# Adjust plot margins and set the legend position at the top
g = g + theme(plot.margin = margin(t = 0, r = 0, b = 150, l = 0, unit = "pt"), legend.position = "top")

# Add colored points to nodes and set legend position at the top
g = g + geom_nodepoint(size = 4, color = "#c7254e") + theme(legend.position = 'top')

# Add title to the plot
g = g + labs(title = "Parsimony tree")

# Display the final plot
plot(g)

2.5 Consensus tree

In this section, our focus shifts to the creation of consensus trees based on the previously generated tree data. These consensus trees serve as concise summaries, effectively highlighting shared branches and groupings that emerge from the multitude of individual trees. They provide a streamlined representation of the intricate web of underlying phylogenetic relationships. Furthermore, this approach offers the flexibility to incorporate time-related information when the tree structure lacks polytomies (nodes with multiple descendants).

To illustrate the process, we will utilize a dataset comprising 100 parsimonious trees for Burkesuchus. Our methodology aligns with the guidelines outlined by the authors, which are accessible through this link.

library("phangorn")
# Reading trees - ensure that the file contains a header with a path associated with sequences (1669).
tree_paper_data = ReadTntTree("arboles_burkesuchus_100_1008.tree")  # Parsimony score 1669 (In R 1676)

1. Mutimodal methods

Multimodal methods integrate individual matrices that arise from diverse biological evidences by utilizing various strategies such as calculating the mean (Mmean), the median (Mmed), the product (Mprod), the maximum (Mmax), or the minimum (Mmin) values for each corresponding position within the matrices. Additionally, an integration of these matrices can be achieved using weighted measures. These methods enable the analysis of datasets containing varied evolutionary information and molecular clocks, thereby providing a comprehensive insight into evolutionary relationships.

Below is an example of how to implement multimodal methods based on the mean.

# Initialize cophenetic matrix from the first tree in the list of trees
copheneticMatrix = cophenetic(tree_paper_data[[1]])

# Iterate over the rest of the trees in the list, adding their cophenetic matrices to the initialized matrix
for (treeIndex in 2:100) {
  copheneticMatrix = copheneticMatrix + cophenetic(tree_paper_data[[treeIndex]])
}

# Calculate the average cophenetic matrix by dividing by the total number of trees
copheneticMatrix = copheneticMatrix / 100

# Build a Neighbour Joining tree from the averaged cophenetic matrix
multimodalTree = NJ(copheneticMatrix)

print(is.binary.tree(multimodalTree))

## [1] TRUE

print(parsimony(multimodalTree,burkesuchus_sequences))

## [1] 3563

2. Consensus rule

Many of the consensus tree methods proposed in the field primarily rely on splits and clusters, often neglecting the consideration of branch lengths. Various rules exist for amalgamating topological information. For instance, the Strict Consensus Tree (CSstr) incorporates only the splits that are common across all the trees, while the Majority Rule Tree (CSmaj) includes the common splits observed in more than half of the input trees. Conversely, the Loose or Semi-Strict Consensus Tree employs exact splits that are compatible with every input tree. Adam’s Consensus Tree, on the other hand, takes into account the common splits from rooted input trees and addresses conflicting tips by placing them in a distinct isolated node. These methods are versatile, allowing for generalization through weighting of input trees, and often lead to the generation of multifurcated trees, known as polytomies.

# Initialize the consensus tree using the Strict Rule
# The parameter p=1 indicates that only splits common to all input trees are included.
strict_tree = consensus(tree_paper_data)
strict_tree = acctran(strict_tree, burkesuchus_sequences)
strict_tree = midpoint(strict_tree)
print(is.binary.tree(strict_tree))

## [1] TRUE

print(parsimony(strict_tree,burkesuchus_sequences))

## [1] 1743

# Initialize the consensus tree using the Majority Rule
# The parameter p=0.5 signifies that splits present in more than half of the input trees are included.
majority_tree = consensus(tree_paper_data, p=0.5)
majority_tree = acctran(majority_tree, burkesuchus_sequences)
majority_tree = midpoint(majority_tree)
print(is.binary.tree(majority_tree))

## [1] TRUE

print(parsimony(majority_tree,burkesuchus_sequences))

## [1] 1676

3. MRP

In this section, we delve into the intricacies of Matrix Representation with Parsimony (MRP) analysis applied to our input trees. The primary objective is to encode the intricate network of input trees into a singular, cohesive matrix using a binary representation scheme. This matrix encapsulates the wealth of evolutionary data and variations present in the original trees. Subsequently, we leverage the parsimony criterion to navigate this matrix and extract a consensus tree that succinctly captures the shared evolutionary signals among the diverse input trees. In essence, MRP analysis is a powerful method for distilling complex phylogenetic information into a coherent and informative representation.

# Load the "phytools" R library, which provides phylogenetic analysis tools.
library("phytools")

# Suppress messages from the mrp.supertree and acctran functions
mrp_tree = mrp.supertree(tree_paper_data)
mrp_tree = acctran(mrp_tree, burkesuchus_sequences)
mrp_tree = midpoint(mrp_tree)
print(is.binary.tree(mrp_tree))
print(parsimony(mrp_tree,burkesuchus_sequences))

4. Average tree

In this section, we delve into the concept of tree averaging—a method that aims to synthesize a consensus tree from a set of input trees. This process involves several key steps, beginning with the transformation of each input tree into a pairwise distance matrix, commonly referred to as a cophenetic matrix. These matrices serve as a foundation for generating a single composite matrix that encapsulates the collective information present in the original trees.

There are two notable approaches to achieving this consensus: the CSave method and the median consensus tree method. These approaches share a common strategy with multi-modal methods, facilitating the identification of shared patterns within the input trees. An alternative technique employs greedy algorithms to pinpoint a tree with equidistant relationships to the input trees. This method leverages geometric properties to guide the computational process, offering an alternative perspective on consensus tree construction.

# Compute the average tree by suppressing any console messages
average_tree = averageTree(tree_paper_data, method = "path.difference",quiet=T)

average_tree = midpoint(average_tree)
print(is.binary.tree(average_tree))
print(parsimony(average_tree,burkesuchus_sequences))

5. Medoid method

The Medoid method is a pivotal technique employed in tree-based analyses. It revolves around the identification of a tree that minimizes the average difference when compared to all other trees within a given set. In essence, it strives to pinpoint the central or most representative tree from a pool of diverse trees by systematically reducing the overall dissimilarity or distance relative to the other trees.

This method holds significant importance in phylogenetic analyses, contributing to the determination of a consensus or representative tree that adeptly encapsulates the intricate web of evolutionary relationships among species or data points.

# Initialize an empty matrix for RF distances with dimensions 100x100
RF_distance_matrix = matrix(0, 100, 100)

# Calculate Robinson-Foulds (RF) distances between all pairs of trees in 'arboles_bm'
for (a in 1:100) {
  for (b in 1:100) {
    # Calculate the RF distance between tree 'a' and tree 'b' and store it in the matrix
    RF_distance_matrix[a, b] = RF.dist(tree_paper_data[[a]], tree_paper_data[[b]], normalize = TRUE)
  }
}

# Find the index of the tree that minimizes the average RF distance
idx_med = which.min(colMeans(RF_distance_matrix))

# Medoid: Select the tree that corresponds to the medoid index
medoid_tree = tree_paper_data[[idx_med]]
medoid_tree = midpoint(medoid_tree)
print(is.binary.tree(medoid_tree))

## [1] TRUE

print(parsimony(medoid_tree,burkesuchus_sequences))

## [1] 1676

library("gridExtra")    # For organizing plots in a grid

#---------------------------
# multimodalTree
#---------------------------
# Create a ggtree object for visualization and apply the ladderize function for ordering
g1 = ggtree(multimodalTree, layout = "dendrogram", ladderize = TRUE)

# Add tip labels with specified colors
g1 = g1 + geom_tiplab(size = 2, color = species_colors, align = T)

# Adjust plot margins and set the legend position at the top
g1 = g1 + theme(plot.margin = margin(t = 0, r = 0, b = 150, l = 0, unit = "pt"), legend.position = "top")

# Add colored points to nodes and set legend position at the top
g1 = g1 + geom_nodepoint(size = 2, color = "#c7254e") + theme(legend.position = 'top')

# Add title to the plot
g1 = g1 + labs(title = "Multimodal tree")



#---------------------------
# strict_tree
#---------------------------
# Create a ggtree object for visualization and apply the ladderize function for ordering
g2 = ggtree(strict_tree, layout = "dendrogram", ladderize = TRUE)

# Add tip labels with specified colors
g2 = g2 + geom_tiplab(size = 2, color = species_colors, align = T)

# Adjust plot margins and set the legend position at the top
g2 = g2 + theme(plot.margin = margin(t = 0, r = 0, b = 150, l = 0, unit = "pt"), legend.position = "top")

# Add colored points to nodes and set legend position at the top
g2 = g2 + geom_nodepoint(size = 2, color = "#c7254e") + theme(legend.position = 'top')

# Add title to the plot
g2 = g2 + labs(title = "Strict consensus rule tree")


#---------------------------
# majority_tree
#---------------------------
# Create a ggtree object for visualization and apply the ladderize function for ordering
g3 = ggtree(majority_tree, layout = "dendrogram", ladderize = TRUE)

# Add tip labels with specified colors
g3 = g3 + geom_tiplab(size = 2, color = species_colors, align = T)

# Adjust plot margins and set the legend position at the top
g3 = g3 + theme(plot.margin = margin(t = 0, r = 0, b = 150, l = 0, unit = "pt"), legend.position = "top")

# Add colored points to nodes and set legend position at the top
g3 = g3 + geom_nodepoint(size = 2, color = "#c7254e") + theme(legend.position = 'top')

# Add title to the plot
g3 = g3 + labs(title = "Majority consensus rule tree")


#---------------------------
# mrp_tree
#---------------------------
# Create a ggtree object for visualization and apply the ladderize function for ordering
g4 = ggtree(mrp_tree, layout = "dendrogram", ladderize = TRUE)

# Add tip labels with specified colors
g4 = g4 + geom_tiplab(size = 2, color = species_colors, align = T)

# Adjust plot margins and set the legend position at the top
g4 = g4 + theme(plot.margin = margin(t = 0, r = 0, b = 150, l = 0, unit = "pt"), legend.position = "top")

# Add colored points to nodes and set legend position at the top
g4 = g4 + geom_nodepoint(size = 2, color = "#c7254e") + theme(legend.position = 'top')

# Add title to the plot
g4 = g4 + labs(title = "MRP tree")


#---------------------------
# average_tree
#---------------------------
# Create a ggtree object for visualization and apply the ladderize function for ordering
g5 = ggtree(average_tree, layout = "dendrogram", ladderize = TRUE)

# Add tip labels with specified colors
g5 = g5 + geom_tiplab(size = 2, color = species_colors, align = T)

# Adjust plot margins and set the legend position at the top
g5 = g5 + theme(plot.margin = margin(t = 0, r = 0, b = 150, l = 0, unit = "pt"), legend.position = "top")

# Add colored points to nodes and set legend position at the top
g5 = g5 + geom_nodepoint(size = 2, color = "#c7254e") + theme(legend.position = 'top')

# Add title to the plot
g5 = g5 + labs(title = "Average consensus tree")


#---------------------------
# medoid_tree
#---------------------------
# Create a ggtree object for visualization and apply the ladderize function for ordering
g6= ggtree(medoid_tree, layout = "dendrogram", ladderize = TRUE)

# Add tip labels with specified colors
g6 = g6 + geom_tiplab(size = 2, color = species_colors, align = T)

# Adjust plot margins and set the legend position at the top
g6 = g6 + theme(plot.margin = margin(t = 0, r = 0, b = 150, l = 0, unit = "pt"), legend.position = "top")

# Add colored points to nodes and set legend position at the top
g6 = g6 + geom_nodepoint(size = 2, color = "#c7254e") + theme(legend.position = 'top')

# Add title to the plot
g6 = g6 + labs(title = "Medoid tree")



# Organize the two plots in a two-column grid
grid.arrange(g1, g2, g3, g4, g5, g6, nrow=3, ncol = 2)

2.6 Time Scaling

# Load the "strap" library
library("strap")

# Read the CSV file containing the annotations. The separator is ";", and the file has a header.
annotations = read.csv("anotaciones_burkesuchus_age.csv", sep=";", header=T)

# Create a time variable 'ages' by extracting the 7th and 13th columns from the annotations.
# Set the row names of 'ages' to be the first column of the annotations.
# Rename the columns of 'ages' to "FAD" and "LAD".
ages = annotations[,c(7,13)]
rownames(ages) = annotations[,1]
names(ages) = c("FAD", "LAD")

# I noticed that the variable 'medoid_tree' is not defined in the provided code. 
# Assuming it's defined elsewhere in your code, add time to the medoid tree using the 'DatePhylo' function.
# The method used is "equal" and the relative length of the branches is 0.1.
medoid_tree_with_time = DatePhylo(medoid_tree, ages, method="equal", rlen=0.1)

# Plot the phylogenetic tree with geological timescale.
# Set the sizes of ages, tips, and time slices text with 'cex.age', 'cex.tip', and 'cex.ts'.
# Set the tick scale to "Age" and define the units of geological timescale.
# Set the rotation of Epoch and units text, and the direction of the tree.
# Remove quarternary period with 'quat.rm = T' and set the version of the International Chronostratigraphic Chart to 2013.
geoscalePhylo(tree=medoid_tree_with_time, ages=ages, boxes="Age", cex.age = 0.89, cex.tip=1, cex.ts = 0.9,
              tick.scale="Age", units = c("Eon","Era","Period","Epoch","Age"), erotate = 90, urotate = 0,
              direction = "rightwards", quat.rm = T, vers = "ICS2013")

>Challenge 08 > >Your task is to construct phylogenetic trees, by appliying the kmers-based strategy.