Lesson 01 - Introduction to Tree Inference from Molecular Sequences Using R

logo

1. Introduction

Welcome to the first lesson on inferring phylogenetic trees from molecular sequences using R. This lesson will form the foundation for understanding the evolutionary relationships between different species or genes based on their molecular sequences. Molecular phylogenetics, at its core, uses sequences - DNA, RNA, or protein - to infer evolutionary relationships. These inferred relationships can provide insights into the history of life, the function and structure of genes, as well as the processes shaping the evolution of genomes. In today’s rapidly advancing genomics age, understanding these relationships is crucial for both basic and applied sciences.

  1. Multiple Sequence Alignment (MSA) in R. Before we can infer trees, we must align our sequences to identify regions of similarity and difference. We will dive deep into the msa package in R to understand its utility in this process.

  2. The Conceptual Basics of Phylogenetics. Understanding the terminology like trees, branches, and nodes, as well as the principle behind constructing phylogenetic trees.

  3. Tree Inference Methods. We will introduce popular methods such as Maximum Likelihood, Neighbor-Joining, and Parsimony.

  4. Visualization of Phylogenetic Trees. A tree is of limited use if we cannot visualize and interpret it. We’ll learn tools and practices to visualize and interpret the inferred trees.

2. Guided Activities

2.1 Multiple Sequence Alignment

Multiple Sequence Alignment, commonly known as MSA, is the foundational step in molecular phylogenetics. By aligning sequences from different organisms or genes, we set the stage for deciphering evolutionary relationships. This process arranges DNA, RNA, or protein sequences to identify and highlight their similarities and differences, serving as the precursor for subsequent analyses like tree inference. Proper alignment is crucial, as it can significantly influence the results of the phylogenetic analyses that follow.

R offers a plethora of packages tailored for molecular sequence analysis. Among them, the msa package stands out as a powerful tool for multiple sequence alignment. Using msa, researchers can tap into various renowned alignment algorithms without juggling multiple software tools. This package not only facilitates the alignment process but also integrates seamlessly with other R-based bioinformatics tools, ensuring a smooth workflow from sequence alignment to phylogenetic analysis.

Here’s a brief glimpse of how to use the msa package:

1. Installation and Setup

If you haven’t already, you need to install and then load the msa package.

# Install the BiocManager package from CRAN
library("BiocManager") #install.packages("BiocManager")

# Use BiocManager to install the 'msa' package from Bioconductor
require("msa") #BiocManager::install("msa")

# Load the 'msa' package into the R session
library("msa")

The msa package is one the main packages available by Bioconductor.

2. Performing an Alignment

With your sequences ready (either manually input or read from a file), you can quickly align them.

# Load the 'Biostrings' package. 
# 'Biostrings' provides efficient containers to handle and analyze biological sequences.
library("Biostrings")

# Define a vector containing four DNA sequences
sequences = c("ACTGGCTG", "ACTGCTG", "AGTGACT", "TGTGACTGA")

# Convert the vector of sequences into a DNAStringSet object using the Biostrings package
biostrings_sequences = DNAStringSet(sequences)

# Perform multiple sequence alignment on the DNAStringSet object using the 'msa' package
msa_result_sample = msa(biostrings_sequences,method = "ClustalW")
## use default substitution matrix
# Print the original sequences
print(biostrings_sequences)  # Note: You had a typo here; it was 'biostrings_sequence'
## DNAStringSet object of length 4:
##     width seq
## [1]     8 ACTGGCTG
## [2]     7 ACTGCTG
## [3]     7 AGTGACT
## [4]     9 TGTGACTGA
# Print the alignment result
print(msa_result_sample)
## CLUSTAL 2.1  
## 
## Call:
##    msa(biostrings_sequences, method = "ClustalW")
## 
## MsaDNAMultipleAlignment with 4 rows and 9 columns
##     aln 
## [1] AGTGACT--
## [2] TGTGACTGA
## [3] ACTGGCTG-
## [4] -ACTGCTG-
## Con AGTG?CTG-

Here an example with amine acids, including a penalization matrix,

# Define a vector containing amino acid sequences of varying lengths
amino_sequences =  c("ARNDCLTQ", "ARN", "ARMDCQK", "RRRTDGPSW")
amino_sequences = AAStringSet(amino_sequences) 
print(amino_sequences)
## AAStringSet object of length 4:
##     width seq
## [1]     8 ARNDCLTQ
## [2]     3 ARN
## [3]     7 ARMDCQK
## [4]     9 RRRTDGPSW
# Realiza la alineación múltiple de las secuencias de aminoácidos.
# El argumento 'type' se establece en "protein" para indicar que estamos trabajando con secuencias de proteínas.
amino_sequences = msa(amino_sequences, type = "protein")
## use default substitution matrix
print(amino_sequences)
## CLUSTAL 2.1  
## 
## Call:
##    msa(amino_sequences, type = "protein")
## 
## MsaAAMultipleAlignment with 4 rows and 9 columns
##     aln 
## [1] -ARNDCLTQ
## [2] -ARN-----
## [3] -ARMDCQK-
## [4] RRRTDGPSW
## Con -ARNDC??-
# Carga la matriz de sustitución BLOSUM62, que es una matriz comúnmente utilizada para comparar secuencias de proteínas.
data(BLOSUM62)

# Calcula la puntuación de conservación para las secuencias de aminoácidos alineadas utilizando la matriz BLOSUM62.
amino_sequences = msaConservationScore(amino_sequences, BLOSUM62)

print(amino_sequences)
##   -   A   R   N   D   C   ?   ?   - 
## -10  35  80  24  31   7 -19  -9 -16

Challenge 01

The ALKBH5 gene is part of a family of genes that play crucial roles in DNA repair and demethylation processes. Its activity can influence various biological processes including gene regulation, cellular differentiation, and development. In humans and many other organisms, mutations or dysregulation in this gene can lead to various health concerns, highlighting its significance in molecular biology.

Now, let’s turn our attention to some aquatic species. The fish species Salmo trutta (brown trout), Salmo salar (Atlantic salmon), Oncorhynchus kisutch (coho salmon), Oncorhynchus tshawytscha (Chinook salmon), and Oncorhynchus mykiss (rainbow trout) are not only commercially valuable but also vital subjects in ecological and evolutionary studies. Analyzing the genetic similarities and differences among these species can offer insights into their evolutionary paths, potential adaptive mechanisms, and overall genetic health.

Proceed with a multiple sequence alignment of the ALKBH5 gene utilizingene using this fasta file. To import the fasta file, employ the read.fasta() function from the seqinr library)

2.2 Pairwise Sequence Distances

After aligning sequences, the next step in phylogenetic inference is often to compute the distances between sequences. This provides a quantitative measure of how much two sequences have diverged over evolutionary time. Such pairwise distances can offer insights into the evolutionary relationships among sequences and are often used to construct phylogenetic trees, especially in distance-based methods like UPGMA or Neighbor Joining.

For our set of sequences, we’ll explore several distance metrics: Hamming distance, \(p\)-distance, Jukes Cantor, and Kimura 2-parameter (K80). Each of these metrics provides a unique perspective on sequence divergence, and understanding their differences is vital for phylogenetic analysis.

Let us study these distances with four sequences example data.

1. Hamming Distance

Hamming distance is the simplest measure, counting the number of positions at which two sequences differ. Gaps are not compared.

# Load the phangorn library: Provides tools for phylogenetic reconstruction and analysis
library("phangorn")

# Load the ape library: Essential for reading, writing, and manipulating phylogenetic trees
library("ape")

# Load the knitr library: Used for generating HTML tables
library("knitr")

# Define a vector containing four DNA sequences
# Convert the multiple sequence alignment result to a phyDat object for downstream analyses in phangorn
phyDat_msa_sample = as.phyDat(msa_result_sample)

# Assign custom names to the sequences
names(phyDat_msa_sample) = c("Seq 1", "Seq 2", "Seq 3", "Seq 4")

# Compute the Hamming distance matrix for the aligned sequences
D_hamming = dist.hamming(phyDat_msa_sample, ratio = FALSE)

# Round the values in the distance matrix to two decimal places
D_hamming = round(as.matrix(D_hamming), 2)

# Print the Hamming distance matrix as an HTML table with a 20% width
kable(D_hamming, format = "html", table.attr = "style='width:20%;'")
Seq 1 Seq 2 Seq 3 Seq 4
Seq 1 0 1 2 4
Seq 2 1 0 3 4
Seq 3 2 3 0 3
Seq 4 4 4 3 0

2. P-distance

P-distance is the proportion of nucleotide sites at which two sequences being compared are different.

# Compute the distance matrix using the 'dist.hamming' function (assuming it's from a specific package)
D_p = dist.hamming(phyDat_msa_sample,ratio = T)

# Convert the distance matrix to a matrix and round the values
D_p = round(as.matrix(D_p),2)


# Print the p distance matrix as an HTML table with a 20% width
kable(D_p, format = "html", table.attr = "style='width:20%;'")
Seq 1 Seq 2 Seq 3 Seq 4
Seq 1 0.00 0.11 0.22 0.44
Seq 2 0.11 0.00 0.33 0.44
Seq 3 0.22 0.33 0.00 0.33
Seq 4 0.44 0.44 0.33 0.00

3. Edition distance

The Levenshtein Edit Distance measures the minimum number of single-character edits (insertions, deletions, or substitutions) required to change one word into another, providing a metric of similarity.

# Load the "stringdist" library, which is used to calculate distances between character strings.
library("stringdist")

# Convert the sequences into character strings (if they are not already).
seq_chars = as.character(sequences)

# Calculate the Levenshtein distance matrix (edit distance) between the character sequences.
D_Edit = stringdistmatrix(seq_chars, seq_chars, method = "lv")

# Assign custom names to the sequences
colnames(D_Edit) = c("Seq 1", "Seq 2", "Seq 3", "Seq 4")
rownames(D_Edit) = c("Seq 1", "Seq 2", "Seq 3", "Seq 4")

# Round the values in the distance matrix to two decimal places.
D_Edit = round(D_Edit, 2)

# Generate an HTML-formatted table (using the kable function) to display the distance matrix.
# Apply a CSS style attribute to adjust the table width to 20% of the available width.
kable(D_Edit, format = "html", table.attr = "style='width:20%;'")
Seq 1 Seq 2 Seq 3 Seq 4
Seq 1 0 1 3 4
Seq 2 1 0 3 4
Seq 3 3 3 0 3
Seq 4 4 4 3 0

4. Jukes-Cantor (JC69)

Jukes-Cantor 1969 considers the possibility of multiple changes at a given position, providing corrections for back mutations.

# Compute the p distance matrix for the aligned sequences
D_JK69 = dist.dna(as.DNAbin(phyDat_msa_sample), model = "JC69")

# Convert the distance matrix to a matrix and round the values
D_JK69 = round(as.matrix(D_JK69),2)

# Print the JC69 as an HTML table with a 20% width
kable(D_JK69, format = "html", table.attr = "style='width:20%;'")
Seq 1 Seq 2 Seq 3 Seq 4
Seq 1 0.00 0.00 0.44 1.65
Seq 2 0.00 0.00 0.44 1.65
Seq 3 0.44 0.44 0.00 0.82
Seq 4 1.65 1.65 0.82 0.00

5. Kimura 2-parameter (K80)

The Kimura 2-parameter model considers both transition and transversion substitutions, offering a more nuanced view of sequence divergence.

# Calculate the distance matrix with the K80 model and store it in 'D_K80'
D_K80 = dist.dna(as.DNAbin(phyDat_msa_sample), model = "K80")

# Convert the distance matrix to a matrix and round the values to two decimal places
D_K80 = round(as.matrix(D_K80), 2)

# Replace any NA (Not Available) values in the matrix with 0
D_K80[which(is.na(D_K80))] = 0

# Print the K80 as an HTML table with a 20% width
kable(D_K80, format = "html", table.attr = "style='width:20%;'")
Seq 1 Seq 2 Seq 3 Seq 4
Seq 1 0.00 0.00 0.45 0.00
Seq 2 0.00 0.00 0.45 0.00
Seq 3 0.45 0.45 0.00 0.82
Seq 4 0.00 0.00 0.82 0.00

Challenge 02

  • Compute the distances for the ALKBH5 gene. In addition to the methods you are familiar with, include two new types of distance metrics.
  • Reflect on the following: Why might using the Levenshtein distance not be ideal in this context?

2.3 Preamble to Phylogenetic Trees

Understanding the evolutionary relationships between species is pivotal in many biological studies. Phylogenetic trees provide a graphical representation of these relationships, showcasing the divergence of species from common ancestors. In this section, we’ll introduce fundamental operations related to phylogenetic trees in R.

1. Random trees (rtree)

For generating random phylogenetic trees, you can use the ape package:

# Set a seed for reproducibility of random processes
set.seed(1)

# Generate a random tree with 5 tips using the 'rtree' function
random_tree = rtree(5)

# Plot the generated random tree
plot(random_tree)

This will generate and plot a random tree with 5 taxa.

2. Accessing Species in the Tree

To access the tip labels (species or taxa) of the tree:

# Get the tip labels from the 'random_tree' object and store them in 'labels'
labels = random_tree$tip.label

# Print the tip labels
print(labels)
## [1] "t2" "t1" "t3" "t4" "t5"
# Assign new tip labels to 'random_tree'
random_tree$tip.label = c("Species 1", "Species 2", "Species 3", "Species 4", "Species 5")

# Plot the tree with the updated tip labels
plot(random_tree)

3. Adding or Removing a Branch

# Remove a tip labeled "Species 3" from the 'random_tree' object
random_tree1 = drop.tip(random_tree, "Species 3")


# Add a new tip labeled "Species 3" connected to "Species 1" with an edge length of 1
random_tree2 = add.tips(random_tree1, "Species 3", "Species 1", edge.length = 1)

# Plot the trees side by side
# Set the plot layout to display two plots in a single row
par(mfrow = c(1, 2))
plot(random_tree1)
plot(random_tree2)

4. Changing Edge Lengths

You can modify the edge lengths (branches) by adjusting the edge.length attribute:

# Define all edge lengths in 'random_tree' equal to 1
random_tree2$edge.length = random_tree2$edge.length*0 + 1

# Plot the tree with the modified edge lengths
plot(random_tree2)

5. Rooting the Tree

To root (unroot) the tree at a specific tip or node:

# Root the 'random_tree' at a specified node labeled as "Species 1" (Unroot() can be applied to remove the root)
rooted_tree1 = root(random_tree, "Species 1")

# Root the 'random_tree' at its midpoint
rooted_tree2 = midpoint(random_tree)

# Plot the trees
par(mfrow = c(1, 2))
plot(rooted_tree1)
plot(rooted_tree2)

6. Bifurcating the Tree

If your tree is not bifurcating (i.e., it’s multifurcating), you can convert it to a bifurcating tree:

# Create a multifurcating tree
# Remove tip labeled "Species 3" from 'random_tree' and then add it back as a multifurcation with 7 descendants, each with an edge length of 1
random_tree1 = drop.tip(random_tree, "Species 3")
random_tree1 = add.tips(random_tree1, "Species 3", 7, edge.length = 1)

# Solve polytomy by converting the multifurcating tree to a bifurcating tree
random_tree2 = multi2di(random_tree1)

# Plot the trees side by side
# Set the plot layout to display two plots in a single row
par(mfrow = c(1, 2))

# Plot the trees
plot(random_tree1)
plot(random_tree2)

7. Visualizing the Tree

To visualize the tree, we will use the ggtree library:

# Load the necessary libraries
library("ggtree")       # For phylogenetic tree visualization
library("ggplot2")      # For plotting
library("gridExtra")    # For organizing plots in a grid

# Create the first dendrogram-style plot
g1 = ggtree(rooted_tree2, layout = "dendrogram", ladderize = TRUE, color = "#00A499")

# Customize the appearance of the first plot and tip labels
g1 = g1 + geom_tiplab(size = 4, color = "black", hjust = 1, angle = 90)
g1 = g1 + theme(plot.margin = margin(t = 0, r = 0, b = 50, l = 0, unit = "pt"), legend.position = "top")
g1 = g1 + geom_nodepoint(size = 3, color = "#c7254e") + theme(legend.position = 'top')
g1 = g1 + labs(title = "Midpoint rooted tree (dendrogram)")

# Create the second circular-style plot
g2 = ggtree(rooted_tree2, layout = "circular", ladderize = TRUE, color = "#00A499")

# Customize the appearance of the second plot and tip labels
g2 = g2 + geom_tiplab(size = 4, color = "black", hjust = 1, angle = 90)
g2 = g2 + theme(plot.margin = margin(t = 0, r = 0, b = 50, l = 0, unit = "pt"), legend.position = "top")
g2 = g2 + geom_nodepoint(size = 3, color = "#c7254e") + theme(legend.position = 'top')
g2 = g2 + labs(title = "Midpoint rooted tree (circular)")

# Organize the two plots in a two-column grid
grid.arrange(g1, g2, ncol = 2)

8. Reading and Writing trees

The Newick and Nexus formats are standard text formats used to represent phylogenetic trees and associated data.

  • Newick Format. This is a simple format for representing trees in plain text. A typical Newick string might look like: (A,B,(C,D)E)F;, where, A to D are leaves (species/taxa), E is an internal node, and F is the root of the tree.

  • Nexus Format. Nexus is a more complex format used not only for trees but also for other kinds of evolutionary data. It has multiple blocks (e.g., TAXA, DATA, TREES) that allow the representation of various datasets and annotations.

# Saving to Newick
write.tree(rooted_tree2, file="output_newick_file.nwk")

# Saving to Nexus (using 'ape' package)
write.nexus(rooted_tree2, file="output_nexus_file.nex")
# For Newick
tree_newick = read.tree("output_newick_file.nwk")

# For Nexus
tree_nexus = read.nexus("output_nexus_file.nex")

Challenge 03

  • Define Phylo and multiPhylo data structures.
  • Assess tree properties using the is.rooted() and is.binary() functions.

2.4 Ultrametric trees

Ultrametric trees are a special kind of phylogenetic trees where the distances from the root to every tip are equal. These trees are often used to represent hierarchical clustering of sequences and species, showcasing the evolutionary time from the last common ancestor. Distance-based methods for constructing ultrametric trees are crucial for exploring evolutionary relationships, and several algorithms exist for this purpose. Below, we will explore various methods for constructing such trees, including NNC, FN, WPGMA, UPGMC, WPGGMC, UPGMA, and Ward.

You can apply these methods as demonstrated below, using the example dataset comprising four sequences.

# Distance Calculation
# Calculate the Hamming distance matrix for the given aligned sequences
# This serves as a measure of pairwise sequence dissimilarity for tree construction
D_hamming = dist.hamming(phyDat_msa_sample)

# Nearest Neighbor Clustering (NNC)
# Construct a phylogenetic tree using the Nearest Neighbor Clustering method
# This method groups sequences based on the nearest (smallest) pairwise distance
tree_NNC = upgma(D_hamming, "single")
tree_NNC = midpoint(tree_NNC)

# Furthest Neighbor (FN)
# Construct a phylogenetic tree using the Furthest Neighbor method
# This method groups sequences based on the furthest (largest) pairwise distance
tree_FN = upgma(D_hamming, "complete")
tree_FN = midpoint(tree_FN)

# Weighted Pair Group Method with Arithmetic Mean (WPGMA)
# Construct a phylogenetic tree using the WPGMA method
# This method considers all pairwise distances for clustering and calculates average distances
tree_WPGMA = wpgma(D_hamming)
tree_WPGMA = midpoint(tree_WPGMA)

# Unweighted Pair-Group Centroid Method (UPGMC)
# Construct a phylogenetic tree using the UPGMC method
# This method clusters sequences based on the centroid distance without considering the number of sequences in each cluster
tree_UPGMC = upgma(D_hamming, "centroid")
tree_UPGMC = midpoint(tree_UPGMC)

# Weighted Pair-Group Centroid Method (WPGGMC)
# Construct a phylogenetic tree using the WPGGMC method
# This method clusters sequences based on the centroid distance and considers the number of sequences in each cluster
tree_WPGGMC = wpgma(D_hamming, "centroid")
tree_WPGGMC = midpoint(tree_WPGGMC)

# Unweighted Pair Group Method with Arithmetic Mean (UPGMA)
# Construct a phylogenetic tree using the UPGMA method
# This method is similar to WPGMA but does not consider the number of sequences in each cluster for centroid calculation
tree_UPGMA = upgma(D_hamming)
tree_UPGMA = midpoint(tree_UPGMA)

# Ward's Minimum Variance Method
# Construct a phylogenetic tree using Ward's method
# This method minimizes the total within-cluster variance
tree_WARD = upgma(D_hamming, "ward")
tree_WARD = midpoint(tree_WARD)

# Function to plot trees
plot_tree = function(tree_plot, title_plot, max_x) {
  g = ggtree(tree_plot, color = "#00A499", size = 1)
  
  # Customize the appearance of the plot and tip labels
  g = g + geom_tiplab(size = 4, color = "black", align = TRUE) +
    geom_nodepoint(size = 3, color = "#c7254e") +
    labs(title = title_plot, size = 6) +
    xlim(0, max_x) +
    theme(
      # Remove axis lines and text
      axis.line = element_blank(),
      axis.text = element_blank(),
      # Remove all grids
      panel.grid.major = element_blank(),
      panel.grid.minor = element_blank(),
      # Adjust margins and legend position
      plot.margin = margin(t = 0, r = -5, b = 0, l = 0, unit = "pt"),
      legend.position = 'top'
    )
  
  # Add labels for branch lengths with a smaller font size
  g = g + geom_text(aes(x = branch, label = round(branch, 2)), size = 3, color = "black", vjust = -0.5, hjust = 0)
  
  return(g)
}

# Generate plots for each tree
g_NNC    = plot_tree(tree_NNC, "Nearest Neighbor Clustering (NNC)", 0.3)
g_FN     = plot_tree(tree_FN, "Furthest Neighbor (FN)", 0.3)
g_WPGMA  = plot_tree(tree_WPGMA, "Weighted Pair Group Method with Arithmetic Mean (WPGMA)", 0.3)
g_UPGMC  = plot_tree(tree_UPGMC, "Unweighted Pair-Group Centroid Method (UPGMC)", 0.3)
g_WPGGMC = plot_tree(tree_WPGGMC, "Weighted Pair-Group Centroid Method (WPGGMC)", 0.3)
g_UPGMA  = plot_tree(tree_UPGMA, "Unweighted Pair Group Method with Arithmetic Mean (UPGMA)", 0.3)
g_ward   = plot_tree(tree_WARD, "Ward's Minimum Variance Method", 0.3)

# Arrange the plots in a grid
grid.arrange(g_NNC, g_FN, g_WPGMA, g_UPGMC, g_WPGGMC, g_UPGMA, g_ward, nrow = 4, ncol = 2)

Challenge 04

The primates_14 dataset is a classic set of molecular data representing 14 primate species. This dataset includes DNA sequences that allow us to study evolutionary relationships amongst these species. By comparing the genetic information, we can infer how closely related different primate species are and develop hypotheses about their evolutionary history. It can be downloaded from here.

Your task is to construct phylogenetic trees using both the UPGMA (Unweighted Pair Group Method with Arithmetic Mean) and WPGMA (Weighted Pair Group Method with Arithmetic Mean) methods on the primates_14 dataset. Use read.dna(“primates_14.phylip”) to read the sequence file in phylip format.

2.5 Additive trees

Additive trees, also known as metric trees, are a type of phylogenetic tree where the branch lengths accurately represent the genetic distance or evolutionary change between species or sequences. Two common methods for constructing additive trees are Neighbor Joining (NJ) and Bio-Neighbor Joining (BioNJ).

You can apply both methods as demonstrated below, using the example dataset comprising four sequences.

1. Neighbor Joining (NJ)

Neighbor Joining is a distance-based method for constructing phylogenetic trees. The algorithm starts with a star tree (a tree in which all nodes are directly connected to a central node) and iteratively joins the pair of nodes that minimizes the total branch length at each stage. It continues this process until only two nodes remain, connecting them to complete the tree.

# Distance
# Calculate the Hamming distance matrix for the given phylogenetic data
D_hamming = dist.hamming(phyDat_msa_sample)

# Compute the Neighbor Joining tree
nj_tree = nj(D_hamming)
nj_tree$edge.length[which(nj_tree$edge.length<0)]=0
nj_tree = midpoint(multi2di(nj_tree))

2. Bio-Neighbor Joining (BioNJ)

BioNJ is an improvement over the Neighbor Joining method. It aims to minimize the variance of the branch lengths estimation, thus providing more accurate branch lengths and topologies, especially when evolutionary rates are variable among lineages.

# Compute the Neighbor Joining tree
bionj_tree = bionj(D_hamming)
bionj_tree$edge.length[which(bionj_tree$edge.length<0)]=0
bionj_tree = midpoint(multi2di(bionj_tree))

# Create plots for the UPGMA and WPGMA trees
# Specify titles for each plot
g_NJ    = plot_tree(nj_tree,    "Neighbor Joining (NJ)",0.3)
g_BioNJ = plot_tree(bionj_tree, "Bio-Neighbor Joining (BioNJ)",0.3)

# Organize the two plots in a one-row, two-column grid
grid.arrange(g_NJ, g_BioNJ, nrow = 1, ncol = 2)

Challenge 04

Your task is to construct phylogenetic trees using both the NJ (Neighbor Joining) and BioNJ (Bio-Neighbor Joining) methods on the primates_14 dataset.

2.6 Min. Evolution and L. Squares

After exploring Neighbor Joining and Bio-Neighbor Joining methods, let’s delve into two additional approaches for constructing phylogenetic trees: Minimum Evolution (ME) and Least Squares (LS).

1. Minimum Evolution (ME)

Minimum Evolution is a distance-based method that aims to find the tree topology which minimizes the total branch length. This method can utilize Least Squares techniques, such as OLS and WLS, to minimize the sum of the squared differences between the observed distances in the distance matrix and the patristic distances (tree-derived distances).

# Compute the Minimum Evolution tree
me_tree = fastme.bal(D_hamming)
me_tree = multi2di(me_tree)

2. Least Squares (LS)

Least Squares methods in phylogenetic reconstruction aim to minimize the sum of the squared differences between observed distances (from the distance matrix) and the distances predicted by the tree (patristic distances). These methods are categorized into Ordinary Least Squares (OLS) and Weighted Least Squares (WLS), depending on whether equal or different weights are assigned to the distances.

# Compute the Minimum Evolution tree
ols_tree = fastme.ols(D_hamming)
ols_tree = multi2di(ols_tree)

# Organize the two plots in a one-row, two-column grid
g_me_tree  = plot_tree(me_tree,  "Balanced ME",0.55)
g_ols_tree = plot_tree(ols_tree, "Ordinary Least Squares",0.55)
grid.arrange(g_me_tree, g_ols_tree, nrow = 1, ncol = 2)

Both ME scores can be calculated as follows:

# The function takes a phylogenetic tree as input and returns the ME score
ME_score = function(tree_input) {
  # The ME score is the sum of the lengths of all the edges in the tree
  # Extract the edge lengths from the input tree and calculate the ME score
  res_score = round(sum(tree_input$edge.length),2)
  
  # Return the calculated ME score
  return(res_score)
}

# Calculate and print the ME scores for the given trees (me_tree and ols_tree)
print(paste("Minimum Evolution Score for me_tree:", ME_score(me_tree)))
## [1] "Minimum Evolution Score for me_tree: 0.58"
print(paste("Minimum Evolution Score for ols_tree:", ME_score(ols_tree)))
## [1] "Minimum Evolution Score for ols_tree: 0.58"

Also, the LS scores can be calculated as follows:

# Define a function to calculate the Minimum Evolution (ME) score using Least Squares (LS)
# The function takes a tree and a distance matrix as input and returns the LS score
LS_score = function(tree_input, distance_input) {
  # Calculate the cophenetic matrix from the input tree
  # The cophenetic matrix represents the patristic (tree-derived) distances between each pair of taxa
  D = cophenetic(tree_input)
  
  # Extract the input distance matrix
  d = distance_input
  
  # Calculate the LS score by summing the squared differences between the observed distances (d)
  # and the cophenetic distances (D)
  res_score = round(sum((D - d)^2),2)
  
  # Return the calculated LS score
  return(res_score)
}

# Calculate and print the ME scores for the given trees (me_tree and ols_tree)
# using the provided Hamming distance matrix (D_hamming)
print(paste("Minimum Evolution Score for me_tree:", LS_score(me_tree, D_hamming)))
## [1] "Minimum Evolution Score for me_tree: 0.91"
print(paste("Minimum Evolution Score for ols_tree:", LS_score(ols_tree, D_hamming)))
## [1] "Minimum Evolution Score for ols_tree: 0.91"

Challenge 05

Your task is to construct phylogenetic trees using both the BAL and OLS methods on the primates_14 dataset, calculating their ME and LS scores.

2.7 Maximum parsimony

Maximum Parsimony is a character-based method used in building phylogenetic trees. This method aims to identify the tree topology that requires the fewest evolutionary changes to explain the observed data—essentially, the simplest, most “parsimonious” tree. The optimization process may take a significant amount of time for large datasets.

The processing of the four sequences example dataset can be summarised as follows:

suppressWarnings({
  suppressMessages({
    options(verbose = FALSE)
    
    # Distance Calculation
    # Compute the Hamming distance matrix for the given aligned sequences
    # This will serve as a measure of pairwise sequence dissimilarity
    D_hamming = dist.hamming(phyDat_msa_sample)

    # Base Tree Construction
    # Compute the Neighbor Joining (NJ) tree as a base tree
    # NJ is a distance-based method for constructing phylogenetic trees
    nj_tree = nj(D_hamming)

    # Correct Negative Branch Lengths
    # Negative branch lengths are not biologically meaningful,
    # hence set any negative branch lengths to 0
    nj_tree$edge.length[which(nj_tree$edge.length < 0)] = 0

    # Rooting the Tree
    # Root the NJ tree at its midpoint to ensure ultrametricity,
    # and convert any multifurcating nodes to bifurcating nodes
    nj_tree = midpoint(multi2di(nj_tree))

    # Parsimony Optimization
    # Optimize the parsimony of the NJ tree
    # The optim.parsimony function refines the tree topology to minimize the parsimony score
    par_tree = optim.parsimony(nj_tree, phyDat_msa_sample)

    
    #Edge estimation
    par_tree=acctran(par_tree, phyDat_msa_sample)
    par_tree=midpoint(par_tree)
    
    # Calculating and Printing Parsimony
    # Calculate and print the parsimony score of the optimized tree
    # Lower parsimony scores indicate more parsimonious (simpler) trees
    print(paste("Parsimony Score for the optimized tree:", parsimony(par_tree, phyDat_msa_sample)))
  })
})
## [1] "Parsimony Score for the optimized tree: 6"
g_par_tree = plot_tree(par_tree, "Parsimony tree",2.8)
grid.arrange(g_par_tree, nrow = 1, ncol = 1)

Challenge 06

Your task is to construct phylogenetic trees, optimizing for parsimony, using the “primates_14 dataset” and calculate the corresponding scores.

2.8 Maximum likelihood

Maximum Likelihood (ML) is a statistical approach used in phylogenetic reconstruction to find the tree topology that maximizes the probability of observing the given sequence data under a specific model of evolution. It evaluates different tree topologies and branch lengths to identify the most likely phylogenetic tree that explains the observed data.

Steps in Maximum Likelihood Method:

  1. Model Selection. Choose an appropriate substitution model that best fits the sequence data (e.g., Jukes-Cantor, Kimura 2-parameter, GTR).

  2. Tree Search. Explore different tree topologies and branch lengths to find the tree that maximizes the likelihood of the observed data.

  3. Branch Length Optimization. For each tree topology, optimize the branch lengths to maximize the likelihood value.

  4. Tree Evaluation. Assess the reliability of the inferred tree using methods such as bootstrap resampling or Bayesian posterior probabilities.

Implementation in R: In R, the phangorn package can be used to perform Maximum Likelihood estimation for phylogenetic reconstruction. The processing of the four sequences example dataset can be summarised as follows:

# Suppress warnings and messages
suppressWarnings({
  suppressMessages({
    options(verbose = FALSE)
    
    # Distance Calculation
    # Compute the Hamming distance matrix for the given aligned sequences
    # This will serve as a measure of pairwise sequence dissimilarity
    D_hamming = dist.hamming(phyDat_msa_sample)

    # Base Tree Construction
    # Compute the Neighbor Joining (NJ) tree as a base tree
    # NJ is a distance-based method for constructing phylogenetic trees
    nj_tree = nj(D_hamming)

    # Correct Negative Branch Lengths
    # Negative branch lengths are not biologically meaningful,
    # hence set any negative branch lengths to 0
    nj_tree$edge.length[which(nj_tree$edge.length < 0)] = 0
    
    # Model Testing
    # Conduct a model test to determine the best-fitting evolutionary model based on AIC
    mt = modelTest(phyDat_msa_sample, nj_tree, model = c("JC", "K80", "HKY", "GTR"))
    var_mt = attr(mt, "env")
    evolutionary_model = eval(get(mt$Model[which.min(mt$AIC)], var_mt), var_mt)
    
    # Initialize PML Tree
    # Create an initial PML (Phylogenetic Maximum Likelihood) tree using the NJ tree and selected evolutionary model
    initial_pml_tree = pml(tree = nj_tree, data = phyDat_msa_sample, model = evolutionary_model$model,
                  bf = evolutionary_model$bf, Q = evolutionary_model$Q, inv = evolutionary_model$inv,
                  k = evolutionary_model$k, shape = evolutionary_model$shape)
    
    # Optimize PML Tree
    # Optimize the PML tree through various parameters and tree rearrangement methods
    pml_tree = optim.pml(initial_pml_tree, model = evolutionary_model$model, optInv = T, optGamma = F,
                         rearrangement = "stochastic", optBf = F, optQ = F,
                         optEdge = F, optNni = TRUE, control = pml.control(trace = 0))
  })
})
## Model        df  logLik   AIC      BIC
##           JC 5 -31.34555 72.69109 73.67722 
##         JC+I 6 -31.34555 74.69109 75.87444 
##      JC+G(4) 6 -31.34565 74.69129 75.87464 
##    JC+G(4)+I 7 -31.34565 76.69129 78.07187 
##          K80 6 -31.34475 74.6895 75.87285 
##        K80+I 7 -31.34475 76.6895 78.07007 
##     K80+G(4) 7 -31.34486 76.68971 78.07029 
##   K80+G(4)+I 8 -31.34486 78.68971 80.26751 
##          HKY 9 -31.18059 80.36119 82.13621 
##        HKY+I 10 -31.17647 82.35293 84.32518 
##     HKY+G(4) 10 -31.18063 82.36127 84.33351 
##   HKY+G(4)+I 11 -31.17658 84.35315 86.52262 
##          GTR 13 -30.55379 87.10757 89.67149 
##        GTR+I 14 -29.75875 87.51751 90.27865 
##     GTR+G(4) 14 -30.15272 88.30545 91.06659 
##   GTR+G(4)+I 15 -29.75946 89.51893 92.4773
# Visualize the Optimized PML Tree
# Create a visualization of the optimized PML tree with a specified title
pml_tree$tree = multi2di(pml_tree$tree)
g_pml_tree = plot_tree(pml_tree$tree, "Likelihood tree", 2.8)
grid.arrange(g_pml_tree, nrow = 1, ncol = 1)

# Print the selected evolutionary model with a descriptive message
cat("The parameters for this evolutionary model are:\n")
## The parameters for this evolutionary model are:
print(pml_tree)
## model: JC 
## loglikelihood: -31.34555 
## unconstrained loglikelihood: -19.77502 
## 
## Rate matrix:
##   a c g t
## a 0 1 1 1
## c 1 0 1 1
## g 1 1 0 1
## t 1 1 1 0
## 
## Base frequencies:  
##    a    c    g    t 
## 0.25 0.25 0.25 0.25

Challenge 07

Your task is to construct phylogenetic trees, optimizing for likelihood, using the “primates_14 dataset” and calculate the corresponding scores.

2.9 Alignment-Free method

Alignment-free techniques for phylogenetic analysis have gained popularity due to their computational efficiency and scalability. K-mer based methods are a subset of these techniques that utilize substrings of fixed length k from the genomic sequences.

A k-mer is a substring of length k derived from a longer sequence. In phylogenetic analysis, k-mer frequencies within genomic sequences are compared to infer evolutionary relationships. This method is particularly useful for large datasets where traditional alignment methods are computationally expensive.

Procedure:

  1. K-mer Counting. For each sequence in the dataset, extract all possible k-mers and count their occurrences.

  2. Distance Calculation. Compute the pairwise distance between sequences based on their k-mer frequency profiles. Jaccard distance or cosine similarity can be used for this purpose.

3.Tree Construction. Use the calculated distance matrix to build a phylogenetic tree using methods like Neighbor-Joining or UPGMA.

The processing of the four sequences example dataset can be summarised as follows:

library("kmer")

kmer_distance = round(kdistance(phyDat_msa_sample, k = 5),2)

# Print Kmers distance matrix as an HTML table with a 20% width
kable(as.matrix(kmer_distance), format = "html", table.attr = "style='width:20%;'")
Seq 1 Seq 2 Seq 3 Seq 4
Seq 1 0.00 0.25 1.00 1.00
Seq 2 0.25 0.00 1.00 1.00
Seq 3 1.00 1.00 0.00 0.39
Seq 4 1.00 1.00 0.39 0.00
# Compute the Neighbor Joining tree
nj_km_tree = nj(kmer_distance)
nj_km_tree$edge.length[which(nj_km_tree$edge.length<0)]=0
nj_km_tree = midpoint(multi2di(nj_km_tree))


# Visualize the Optimized PML Tree
# Create a visualization of the optimized PML tree with a specified title
g_nj_km_tree = plot_tree(nj_km_tree, "NJ kmer tree", 2.8)
grid.arrange(g_nj_km_tree, nrow = 1, ncol = 1)

Challenge 08

Your task is to construct phylogenetic trees, by appliying the kmers-based strategy.

2.10 Bootstrapping

Bootstrapping is a resampling technique used to estimate the reliability of the branches (clades) in a phylogenetic tree. It involves generating multiple pseudo-replicate datasets by randomly sampling characters with replacement and then inferring a tree for each dataset.

It can be used as follows:

# Calculate the Hamming distance for the given phylogenetic data
distance = dist.hamming(phyDat_msa_sample)

# Construct a Neighbor Joining tree based on the calculated distance
nj_tree = NJ(distance)

# Resolve multifurcations in the tree
nj_tree = multi2di(nj_tree)

# Place the root of the tree at its midpoint
nj_tree = midpoint(nj_tree)

# Define a function that computes the NJ tree based on Hamming distance
f = function(x) nj(dist.hamming(x))

# Perform bootstrap analysis on the tree
bootstrapped_trees = boot.phylo(nj_tree, as.DNAbin(phyDat_msa_sample), f, trees = T)
## 
Running bootstraps:       100 / 100
## Calculating bootstrap values... done.
# Add bootstrap confidence values to the nodes of the original tree
bootstrapped_conf_tree = addConfidences(nj_tree, bootstrapped_trees$trees)

# Define a color palette for visualizing bootstrap support
color_palette = colorRampPalette(c("red3", "mediumseagreen"))(100)
color_positions = round(bootstrapped_conf_tree$node.label * 97, 0)
colors = color_palette[color_positions]

# Create a plot of the bootstrapped tree using ggtree
g = ggtree(bootstrapped_conf_tree, layout = "dendrogram", ladderize = T)
g = g + geom_tiplab(size = 4, color = "black")  # Add labels to terminal nodes
g = g + theme(plot.margin = margin(t = 0, r = 0, b = 150, l = 0, unit = "pt"), legend.position = "top")
g = g + geom_nodepoint(size = 4, color = colors) + theme(legend.position='top')
g = g + labs(title = "NJ tree Bootstrapping (100 trees)")

# Create a data frame with values from 1 to 100 for color mapping
color_data = data.frame(x = 1:100, y = 1)
color_data$colors = color_palette

# Create a color bar plot using ggplot2
g1 = ggplot(color_data, aes(x = x, y = y, fill = colors)) +
  geom_bar(stat = "identity", width = 1, position = "identity") + scale_fill_manual(values = rev(color_data$colors)) +
  labs(title = NULL,
       x = "% of node support",
       y = NULL) +
  theme_minimal() + theme(axis.text.y = element_blank(),  # Remove y-axis labels
                          axis.title.y = element_blank(),
                          panel.grid = element_blank())  # Remove y-axis title
g1 = g1 + guides(fill = "none")

# Arrange the tree plot and the color bar plot
grid.arrange(g, g1, nrow = 2, ncol = 1, heights = c(0.8, 0.2))

Challenge 09

Apply bootrapping to the NJ tree inferred using the primates_14 dataset.

2.11 Phylogenetic forest

In the diverse and intricate field of phylogenetic analysis, a Phylogenetic Forest represents an advanced approach to discern the evolutionary relationships among various species or genes. This methodology is not confined to a single tree, but rather it encompasses a collection of phylogenetic trees, each offering a unique perspective on the evolutionary history of the organisms under study. Each tree within this ‘forest’ is derived from distinct datasets or analytical methodologies, hence providing a comprehensive and multifaceted view of phylogeny.

The concept of a Phylogenetic Forest arises from the acknowledgment that a solitary tree might not capture the full complexity and variability inherent in evolutionary processes. The presence of conflicting signals, due to events like horizontal gene transfer, recombination, or incomplete lineage sorting, may lead to uncertainties and discrepancies in phylogenetic reconstruction. By analyzing a ‘forest’ of trees, researchers are better equipped to address these challenges and extract robust and reliable insights regarding evolutionary relationships.

In this case, we will create a phylogenetic forest using KMeans.

# Load necessary libraries
library("clValid")          # For validation of clustering results
library("factoextra")       # For visualizing results

# Define a function for applying the KMEANS method
kmeans_method <- function(tree_distance_matrix) {
  
  # Analysis will be performed for a range of 2 to 3 clusters
  possible_k <- seq(2, 3)
  
  # Set seed for reproducibility
  set.seed(2)
  
  # Initialize a variable to store clustering quality indices
  clustering_quality_results <- NULL
  
  # Loop over possible k values to calculate Dunn, Connectivity and Silhouette indices
  for (i in possible_k) {
    # Perform clustering using KMEANS for different numbers of clusters
    clusters <- kmeans(tree_distance_matrix, i)
    
    # Calculate Silhouette score for the current number of clusters
    silhouette_score <- mean(silhouette(clusters$cluster, tree_distance_matrix)[,3])
    
    # Store results in a temporary variable
    tmp <- c("Kmeans", i, round(silhouette_score, 3))
    
    # Append results to the global variable clustering_quality_results
    clustering_quality_results <- rbind(clustering_quality_results, tmp)
  }
  
  # Set column names for the results data frame
  colnames(clustering_quality_results) <- c("Method", "k", "Silhouette")
  rownames(clustering_quality_results) <- NULL
  clustering_quality_results <- data.frame(clustering_quality_results)
  clustering_quality_results$Silhouette <- as.numeric(clustering_quality_results$Silhouette)
  
  # Return the clustering_quality_results data frame
  return(clustering_quality_results)
}

# Calculate Robinson-Foulds distance
trees <- c(nj_tree, bootstrapped_trees$trees)
tree_distance_bm <- RF.dist(trees, normalize = T, check.labels = F)
tree_distance_bm <- round(tree_distance_bm, 2)
tree_distance_bm <- as.matrix(tree_distance_bm)

# Apply KMEANS method and determine optimal number of clusters
quality_kmeans <- kmeans_method(tree_distance_bm)
optimal_k <- which.max(quality_kmeans$Silhouette)

# Perform kmeans clustering with optimal number of clusters
set.seed(2)
kmeans_results <- kmeans(tree_distance_bm, optimal_k)

# Visualization
coordinates <- fviz_cluster(kmeans_results, data = tree_distance_bm)$data
g <- fviz_cluster(kmeans_results, data = tree_distance_bm, ellipse.type = "norm",
                  labelsize = 10, repel = TRUE, show.clust.cent = FALSE, ggtheme = theme_classic(),
                  pointsize = 3, shape = "circle", palette = c("#00A499", "#695c59"),)
g <- g + labs(title = "Phylogenetic Forest - 4 Sequences")
g <- g + theme(axis.text = element_text(size = 12), legend.position = "bottom")

# Plot the result
plot(g)

Challenge 09

Create a phylogenetic tree for the primate_14 data set.