EGB - MC03: Introduction to Phylogenetic Analysis Using R

1. Introduction

Welcome to the first lesson on inferring phylogenetic trees from molecular sequences using R. This lesson provides the foundation for understanding evolutionary relationships among species or genes based on their molecular data. At its core, molecular phylogenetics employs DNA, RNA, or protein sequences to reconstruct evolutionary history. These relationships offer insights into the origin of life, gene function and structure, and the mechanisms that shape genome evolution. In the current genomics era, mastering these concepts is essential for both basic and applied research in biology.

The key components of this lesson include:

Multiple Sequence Alignment (MSA) in R. Before tree inference, sequences must be aligned to identify conserved and variable regions. We will examine how to use the msa package in R to perform and interpret these alignments.
Conceptual Foundations of Phylogenetics. We will define essential terms such as trees, branches, nodes, and clades, and explain the principles behind evolutionary tree construction.
Tree Inference Methods. Common inference strategies will be introduced, including Maximum Likelihood, Neighbour-Joining, and Parsimony.
Visualisation of Phylogenetic Trees. We will explore tools and techniques to effectively visualise and interpret phylogenetic trees using R.

2. Guided Activities

2.1 Multiple Sequence Alignment

Multiple sequence alignment (MSA) is a foundational step in molecular phylogenetics. By aligning homologous DNA, RNA, or protein sequences from different organisms or genes, we establish a framework for inferring evolutionary relationships. This process reveals conserved and variable regions, serving as a critical precursor to tree construction. The quality of the alignment directly influences the accuracy of subsequent phylogenetic analyses.

R provides several packages for molecular sequence analysis. Among them, the msa package stands out for its robustness and ease of integration. It offers access to widely used alignment algorithms, including ClustalW, ClustalOmega, and MUSCLE, all within the R environment. The package is part of the Bioconductor ecosystem, ensuring compatibility with downstream bioinformatics workflows.

Below is a basic example of how to install and load the msa package:

1. Installation and Setup

If you haven’t already, you need to install and then load the msa package.

# Load BiocManager (install if not already installed)
# install.packages("BiocManager")
library(BiocManager)

# Install the 'msa' package from Bioconductor
# BiocManager::install("msa")
library(msa)

Once loaded, the msa package provides functions for aligning sequences represented as DNAStringSet, RNAStringSet, or AAStringSet objects. These data structures are part of the Biostrings package and are specifically designed for efficient manipulation of biological sequences within R.

2. Performing an Alignment

With your sequences prepared—either entered manually or read from a FASTA file—you can perform a multiple sequence alignment using the msa

# Load the Biostrings package, which provides efficient data structures for biological sequences
library(Biostrings)

# Define a vector of DNA sequences
sequences <- c("ACTGGCTG", "ACTGCTG", "AGTGACT", "TGTGACTGA")

# Convert the character vector into a DNAStringSet object
dna_set <- DNAStringSet(sequences)

# Perform multiple sequence alignment using the ClustalW algorithm
msa_result_dna <- msa(dna_set, method = "ClustalW",verbose = F)

## use default substitution matrix

# Print the original sequences
print(dna_set)

## DNAStringSet object of length 4:
##     width seq
## [1]     8 ACTGGCTG
## [2]     7 ACTGCTG
## [3]     7 AGTGACT
## [4]     9 TGTGACTGA

# Print the alignment result
print(msa_result_dna)

## CLUSTAL 2.1  
## 
## Call:
##    msa(dna_set, method = "ClustalW", verbose = F)
## 
## MsaDNAMultipleAlignment with 4 rows and 9 columns
##     aln 
## [1] AGTGACT--
## [2] TGTGACTGA
## [3] ACTGGCTG-
## [4] -ACTGCTG-
## Con AGTG?CTG-

You can also perform multiple sequence alignment using amino acid sequences. In this example, we include a conservation score calculation using the BLOSUM62 substitution matrix:

# Define a set of amino acid sequences
aa_sequences <- c("ARNDCLTQ", "ARN", "ARMDCQK", "RRRTDGPSW")
aa_set <- AAStringSet(aa_sequences)

# Display the input amino acid sequences
print(aa_set)

## AAStringSet object of length 4:
##     width seq
## [1]     8 ARNDCLTQ
## [2]     3 ARN
## [3]     7 ARMDCQK
## [4]     9 RRRTDGPSW

# Perform multiple sequence alignment for protein sequences
msa_result_aa <- msa(aa_set, type = "protein")

## use default substitution matrix

print(msa_result_aa)

## CLUSTAL 2.1  
## 
## Call:
##    msa(aa_set, type = "protein")
## 
## MsaAAMultipleAlignment with 4 rows and 9 columns
##     aln 
## [1] -ARNDCLTQ
## [2] -ARN-----
## [3] -ARMDCQK-
## [4] RRRTDGPSW
## Con -ARNDC??-

# Load the BLOSUM62 substitution matrix
data(BLOSUM62)

# Calculate conservation scores for the aligned protein sequences
conservation_scores <- msaConservationScore(msa_result_aa, BLOSUM62)

# Print the conservation scores
print(conservation_scores)

##   -   A   R   N   D   C   ?   ?   - 
## -10  35  80  24  31   7 -19  -9 -16

This approach allows you to work directly with DNA or protein sequences, and evaluate conservation based on biologically meaningful scoring schemes such as BLOSUM62.

Challenge 01

The ALKBH5 gene belongs to a family of genes involved in DNA repair and demethylation. Its enzymatic activity plays a key role in gene regulation, cellular differentiation, and developmental processes. In humans and many other organisms, mutations or dysregulation of ALKBH5 have been associated with various health disorders, underscoring its biological relevance.

Now, let us consider a selection of ecologically and commercially important fish species: Salmo trutta (brown trout), Salmo salar (Atlantic salmon), Oncorhynchus kisutch (coho salmon), Oncorhynchus tshawytscha (Chinook salmon), and Oncorhynchus mykiss (rainbow trout). Comparative analysis of the ALKBH5 gene across these species can shed light on their evolutionary divergence, potential adaptive traits, and genomic conservation.

Your task is to perform a multiple sequence alignment of the ALKBH5 gene using the sequences provided in this FASTA file. To import the file, use the readDNAStringSet() function from the Biostrings package, which reads FASTA files directly into a compatible format for alignment with the msa package.

Make sure to examine the sequences for inconsistencies or missing data, and choose an appropriate alignment method (e.g. ClustalW, MUSCLE) based on the characteristics of the dataset and your biological hypothesis.

2.2 Pairwise Sequence Distances

After performing sequence alignment, the next essential step in phylogenetic inference is to compute pairwise distances. These distances provide a quantitative measure of divergence between sequences and are crucial for tree construction using distance-based methods such as UPGMA or Neighbour-Joining.

For our dataset, we will examine four common distance metrics: Hamming distance, \(p\)-distance, Jukes–Cantor (JC69), and Kimura 2-parameter (K80). Each of these offers a different perspective on how nucleotide substitutions are modelled. Understanding the assumptions behind each metric is vital when selecting the most appropriate method for your data.

We will illustrate each metric using a simple set of four aligned DNA sequences.

1. Hamming Distance

The Hamming distance is the most basic measure of divergence. It simply counts the number of positions at which two sequences differ. This metric ignores gaps and is only applicable to sequences of equal length.

# Load required libraries
library(phangorn)    # Tools for phylogenetic reconstruction
library(ape)         # Phylogenetic tree manipulation
library(knitr)       # For nicely formatted HTML tables

# Convert the alignment result to a phyDat object
phyDat_msa_sample <- as.phyDat(msa_result_dna)

# Rename the sequences for clarity
names(phyDat_msa_sample) <- c("Seq 1", "Seq 2", "Seq 3", "Seq 4")

# Compute the raw Hamming distance matrix
D_hamming <- dist.hamming(phyDat_msa_sample, ratio = FALSE)

# Round the values for display
D_hamming <- round(as.matrix(D_hamming), 2)

# Display the matrix
kable(D_hamming, format = "html", table.attr = "style='width:20%;'")

	Seq 1	Seq 2	Seq 3	Seq 4
Seq 1	0	1	2	4
Seq 2	1	0	3	4
Seq 3	2	3	0	3
Seq 4	4	4	3	0

2. P-distance

The p-distance is the proportion of differing nucleotide sites between two sequences. Unlike Hamming distance, this normalises the count by sequence length. It remains a raw, uncorrected measure of divergence.

# Compute the p-distance matrix (normalised version of Hamming)
D_p <- dist.hamming(phyDat_msa_sample, ratio = TRUE)

# Round and display the matrix
D_p <- round(as.matrix(D_p), 2)
kable(D_p, format = "html", table.attr = "style='width:20%;'")

	Seq 1	Seq 2	Seq 3	Seq 4
Seq 1	0.00	0.11	0.22	0.44
Seq 2	0.11	0.00	0.33	0.44
Seq 3	0.22	0.33	0.00	0.33
Seq 4	0.44	0.44	0.33	0.00

3. Edit Distance

The Levenshtein edit distance quantifies the minimum number of single-character operations (insertions, deletions, or substitutions) required to transform one sequence into another. While this metric is widely used in text and string comparison, it treats sequences as flat character vectors, without considering biological constraints or evolutionary models.

library(stringdist)

# Convert raw sequences into character strings (if not already)
seq_chars <- as.character(sequences)

# Compute the Levenshtein distance matrix
D_Edit <- stringdistmatrix(seq_chars, seq_chars, method = "lv")

# Assign readable names to the rows and columns
colnames(D_Edit) <- rownames(D_Edit) <- c("Seq 1", "Seq 2", "Seq 3", "Seq 4")

# Round and display the matrix
D_Edit <- round(D_Edit, 2)
kable(D_Edit, format = "html", table.attr = "style='width:20%;'")

	Seq 1	Seq 2	Seq 3	Seq 4
Seq 1	0	1	3	4
Seq 2	1	0	3	4
Seq 3	3	3	0	3
Seq 4	4	4	3	0

4. Jukes–Cantor (JC69)

The Jukes–Cantor model (1969) corrects the observed p-distance by accounting for the probability of multiple substitutions at the same site, including reversions. It assumes equal base frequencies and equal substitution rates between all nucleotides.

D_JC69 <- dist.dna(as.DNAbin(phyDat_msa_sample), model = "JC69")
D_JC69 <- round(as.matrix(D_JC69), 2)
kable(D_JC69, format = "html", table.attr = "style='width:20%;'")

	Seq 1	Seq 2	Seq 3	Seq 4
Seq 1	0.00	0.00	0.44	1.65
Seq 2	0.00	0.00	0.44	1.65
Seq 3	0.44	0.44	0.00	0.82
Seq 4	1.65	1.65	0.82	0.00

5. Kimura 2-Parameter (K80)

The Kimura two-parameter model (1980) improves upon JC69 by distinguishing between transitions and transversions, acknowledging that transitions occur more frequently in biological sequences.

D_K80 <- dist.dna(as.DNAbin(phyDat_msa_sample), model = "K80")
D_K80 <- round(as.matrix(D_K80), 2)
D_K80[is.na(D_K80)] <- 0
kable(D_K80, format = "html", table.attr = "style='width:20%;'")

	Seq 1	Seq 2	Seq 3	Seq 4
Seq 1	0.00	0.00	0.45	0.00
Seq 2	0.00	0.00	0.45	0.00
Seq 3	0.45	0.45	0.00	0.82
Seq 4	0.00	0.00	0.82	0.00

Challenge 02 Compute pairwise distances for the aligned ALKBH5 gene sequences. Include at least two additional distance metrics beyond those discussed above. Reflect on the following: Why might the Levenshtein distance be unsuitable for biological sequence comparison in a phylogenetic context? Consider how biological relevance and model assumptions affect the interpretability of distance values.

2.3 Preamble to Phylogenetic Trees

Understanding evolutionary relationships among species is central to molecular and evolutionary biology. Phylogenetic trees provide a graphical representation of these relationships, depicting the divergence of species from shared ancestors. In this section, we introduce fundamental tree operations using R.

1. Generating Random Trees

You can create random phylogenetic trees using the ape package.

set.seed(1)
random_tree <- rtree(5)
plot(random_tree)

2. Accessing and Modifying Tip Labels

To view or rename tip labels (species) in a tree:

labels <- random_tree$tip.label
print(labels)

## [1] "t2" "t1" "t3" "t4" "t5"

random_tree$tip.label <- c("Species 1", "Species 2", "Species 3", "Species 4", "Species 5")
plot(random_tree)

3. Adding or Removing Branches

You can remove or manually insert tips using drop.tip() and add.tips():

random_tree1 <- drop.tip(random_tree, "Species 3")
random_tree2 <- add.tips(random_tree1, "Species 3", "Species 1", edge.length = 1)
par(mfrow = c(1, 2))
plot(random_tree1)
plot(random_tree2)

4. Modifying Edge Lengths

Edge lengths can be edited directly through the edge.length attribute:

random_tree2$edge.length <- rep(1, length(random_tree2$edge.length))
plot(random_tree2)

5. Rooting Trees

Trees can be rooted manually using a tip or automatically at the midpoint:

rooted_tree1 <- root(random_tree, "Species 1")
rooted_tree2 <- midpoint(random_tree)
par(mfrow = c(1, 2))
plot(rooted_tree1)
plot(rooted_tree2)

6. Resolving Polytomies (Bifurcation)

If a tree contains multifurcations, it can be converted to a strictly bifurcating form:

random_tree1 <- drop.tip(random_tree, "Species 3")
random_tree1 <- add.tips(random_tree1, "Species 3", 7, edge.length = 1)
random_tree2 <- multi2di(random_tree1)
par(mfrow = c(1, 2))
plot(random_tree1)
plot(random_tree2)

7. Tree Visualisation with ggtree

The ggtree package enables elegant and flexible tree visualisation:

library(ggtree)
library(ggplot2)
library(gridExtra)

g1 <- ggtree(rooted_tree2, layout = "dendrogram", ladderize = TRUE, colour = "#00A499") +
       geom_tiplab(size = 4, colour = "black", hjust = 1, angle = 90) +
       geom_nodepoint(size = 3, colour = "#c7254e") +
       theme(plot.margin = margin(0, 0, 50, 0, unit = "pt"), legend.position = "top") +
       labs(title = "Midpoint-rooted tree (dendrogram)")

g2 <- ggtree(rooted_tree2, layout = "circular", ladderize = TRUE, colour = "#00A499") +
       geom_tiplab(size = 4, colour = "black", hjust = 1, angle = 90) +
       geom_nodepoint(size = 3, colour = "#c7254e") +
       theme(plot.margin = margin(0, 0, 50, 0, unit = "pt"), legend.position = "top") +
       labs(title = "Midpoint-rooted tree (circular)")

grid.arrange(g1, g2, ncol = 2)

8. Reading and Writing Trees (Newick & Nexus)

You can export and import trees using the Newick or Nexus formats:

# Write to file
write.tree(rooted_tree2, file = "output_tree.nwk")
write.nexus(rooted_tree2, file = "output_tree.nex")

# Read from file
tree_newick <- read.tree("output_tree.nwk")
tree_nexus  <- read.nexus("output_tree.nex")

Challenge 03 - Define the phylo and multiPhylo object types in R. - Use the functions is.rooted() and is.binary() to assess tree structure.

2.4 Ultrametric Trees

Ultrametric trees are a specific type of phylogenetic tree where the distance from the root to each tip is identical. These trees are commonly used to represent hierarchical clustering and evolutionary timelines from the last common ancestor. In this section, we will construct ultrametric trees using a variety of clustering algorithms.

The methods demonstrated below include: - Nearest Neighbour Clustering (NNC) - Furthest Neighbour (FN) - Weighted Pair Group Method with Arithmetic Mean (WPGMA) - Unweighted Pair Group Centroid Method (UPGMC) - Weighted Pair Group Centroid Method (WPGGMC) - Unweighted Pair Group Method with Arithmetic Mean (UPGMA) - Ward’s Minimum Variance Method

These trees are constructed from a Hamming distance matrix computed from a multiple sequence alignment.

# Calculate the Hamming distance matrix from the aligned sequences
D_hamming <- dist.hamming(phyDat_msa_sample)

# Nearest Neighbour Clustering (NNC)
tree_NNC <- upgma(D_hamming, method = "single")
tree_NNC <- midpoint(tree_NNC)

# Furthest Neighbour (FN)
tree_FN <- upgma(D_hamming, method = "complete")
tree_FN <- midpoint(tree_FN)

# WPGMA (Weighted Pair Group Method with Arithmetic Mean)
tree_WPGMA <- wpgma(D_hamming)
tree_WPGMA <- midpoint(tree_WPGMA)

# UPGMC (Unweighted Pair Group Centroid Method)
tree_UPGMC <- upgma(D_hamming, method = "centroid")
tree_UPGMC <- midpoint(tree_UPGMC)

# WPGGMC (Weighted Pair Group Centroid Method)
tree_WPGGMC <- wpgma(D_hamming, method = "centroid")
tree_WPGGMC <- midpoint(tree_WPGGMC)

# UPGMA (Unweighted Pair Group Method with Arithmetic Mean)
tree_UPGMA <- upgma(D_hamming)
tree_UPGMA <- midpoint(tree_UPGMA)

# Ward's Minimum Variance Method
tree_WARD <- upgma(D_hamming, method = "ward.D2")
tree_WARD <- midpoint(tree_WARD)

# Define plotting function for ultrametric trees
plot_tree <- function(tree_plot, title_plot, max_x) {
  g <- ggtree(tree_plot, colour = "#00A499", size = 1) +
    geom_tiplab(size = 4, colour = "black", align = TRUE) +
    geom_nodepoint(size = 3, colour = "#c7254e") +
    labs(title = title_plot, size = 6) +
    xlim(0, max_x) +
    theme(
      axis.line = element_blank(),
      axis.text = element_blank(),
      panel.grid.major = element_blank(),
      panel.grid.minor = element_blank(),
      plot.margin = margin(t = 0, r = -5, b = 0, l = 0, unit = "pt"),
      legend.position = "top"
    ) +
    geom_text(aes(x = branch, label = round(branch, 2)), size = 3, colour = "black", vjust = -0.5, hjust = 0)
  return(g)
}

# Generate visualisations
g_NNC    <- plot_tree(tree_NNC,    "Nearest Neighbour Clustering (NNC)", 0.3)
g_FN     <- plot_tree(tree_FN,     "Furthest Neighbour (FN)", 0.3)
g_WPGMA  <- plot_tree(tree_WPGMA,  "WPGMA", 0.3)
g_UPGMC  <- plot_tree(tree_UPGMC,  "UPGMC", 0.3)
g_WPGGMC <- plot_tree(tree_WPGGMC, "WPGGMC", 0.3)
g_UPGMA  <- plot_tree(tree_UPGMA,  "UPGMA", 0.3)
g_WARD   <- plot_tree(tree_WARD,   "Ward's Minimum Variance", 0.3)

# Arrange all trees in a grid layout
grid.arrange(g_NNC, g_FN, g_WPGMA, g_UPGMC, g_WPGGMC, g_UPGMA, g_WARD, nrow = 4, ncol = 2)

Challenge 04 The primates_14 dataset contains DNA sequences for 14 primate species, enabling the study of their evolutionary relationships. You can download the dataset here.

Your task: 1. Use read.dna("primates_14.phylip") to import the dataset (PHYLIP format). 2. Construct phylogenetic trees using both UPGMA and WPGMA. 3. Compare the tree topologies. How do the methods affect the resulting tree structures?

2.5 Additive trees

Additive trees, also known as metric trees, are a type of phylogenetic tree where the branch lengths accurately represent the genetic distance or evolutionary change between species or sequences. Two common methods for constructing additive trees are Neighbor Joining (NJ) and Bio-Neighbor Joining (BioNJ).

You can apply both methods as demonstrated below, using the example dataset comprising four sequences.

1. Neighbor Joining (NJ)

Neighbor Joining is a distance-based method for constructing phylogenetic trees. The algorithm starts with a star tree (a tree in which all nodes are directly connected to a central node) and iteratively joins the pair of nodes that minimizes the total branch length at each stage. It continues this process until only two nodes remain, connecting them to complete the tree.

# Distance
# Calculate the Hamming distance matrix for the given phylogenetic data
D_hamming = dist.hamming(phyDat_msa_sample)

# Compute the Neighbor Joining tree
nj_tree = nj(D_hamming)
nj_tree$edge.length[which(nj_tree$edge.length<0)]=0
nj_tree = midpoint(multi2di(nj_tree))

2. Bio-Neighbor Joining (BioNJ)

BioNJ is an improvement over the Neighbor Joining method. It aims to minimize the variance of the branch lengths estimation, thus providing more accurate branch lengths and topologies, especially when evolutionary rates are variable among lineages.

# Compute the Neighbor Joining tree
bionj_tree = bionj(D_hamming)
bionj_tree$edge.length[which(bionj_tree$edge.length<0)]=0
bionj_tree = midpoint(multi2di(bionj_tree))

# Create plots for the UPGMA and WPGMA trees
# Specify titles for each plot
g_NJ    = plot_tree(nj_tree,    "Neighbor Joining (NJ)",0.3)
g_BioNJ = plot_tree(bionj_tree, "Bio-Neighbor Joining (BioNJ)",0.3)

# Organize the two plots in a one-row, two-column grid
grid.arrange(g_NJ, g_BioNJ, nrow = 1, ncol = 2)

Challenge 05

Your task is to construct phylogenetic trees using both the NJ (Neighbor Joining) and BioNJ (Bio-Neighbor Joining) methods on the primates_14 dataset.

2.6 Min Evolution and L Squares

After exploring Neighbor Joining and Bio-Neighbor Joining methods, let’s delve into two additional approaches for constructing phylogenetic trees: Minimum Evolution (ME) and Least Squares (LS).

1. Minimum Evolution (ME)

Minimum Evolution is a distance-based method that aims to find the tree topology which minimizes the total branch length. This method can utilize Least Squares techniques, such as OLS and WLS, to minimize the sum of the squared differences between the observed distances in the distance matrix and the patristic distances (tree-derived distances).

# ----------------- Minimum Evolution Tree (Balanced - BAL) -----------------

# Construct the Minimum Evolution tree using the balanced algorithm (FastME)
me_tree <- fastme.bal(D_hamming)

# Resolve polytomies (multifurcations) and ensure full bifurcation
me_tree <- multi2di(me_tree)

2. Least Squares (LS)

Least Squares methods in phylogenetic reconstruction aim to minimize the sum of the squared differences between observed distances (from the distance matrix) and the distances predicted by the tree (patristic distances). These methods are categorized into Ordinary Least Squares (OLS) and Weighted Least Squares (WLS), depending on whether equal or different weights are assigned to the distances.

# ----------------- Ordinary Least Squares Tree (OLS) -----------------

# Construct a tree using the OLS (Ordinary Least Squares) algorithm via FastME
ols_tree <- fastme.ols(D_hamming)

# Resolve multifurcations to ensure a fully bifurcating tree
ols_tree <- multi2di(ols_tree)

# ----------------- Plotting Both Trees -----------------

# Generate ggtree plots for both methods
g_me_tree  <- plot_tree(me_tree,  "Balanced Minimum Evolution (BAL)", 0.55)
g_ols_tree <- plot_tree(ols_tree, "Ordinary Least Squares (OLS)",     0.55)

# Display the two trees side-by-side
grid.arrange(g_me_tree, g_ols_tree, nrow = 1, ncol = 2)

Both ME scores can be calculated as follows:

# ----------------- Minimum Evolution (ME) Score Calculation -----------------

# Define a function to compute the Minimum Evolution (ME) score of a tree
# The ME score is the total branch length (sum of all edge lengths)
ME_score <- function(tree_input) {
  total_length <- sum(tree_input$edge.length)
  return(round(total_length, 2))
}

# Compute and display ME scores for both BAL and OLS trees
cat("Minimum Evolution Score (BAL):", ME_score(me_tree), "\n")

## Minimum Evolution Score (BAL): 0.58

cat("Minimum Evolution Score (OLS):", ME_score(ols_tree), "\n")

## Minimum Evolution Score (OLS): 0.58

Also, the LS scores can be calculated as follows:

# ------------------ Least Squares (LS) Score Calculation -------------------

# Define a function to compute the LS score between a tree and a distance matrix
# The LS score quantifies the squared differences between observed (input) distances
# and the patristic distances derived from the tree topology
LS_score <- function(tree_input, distance_input) {
  D_tree <- cophenetic(tree_input)       # Patristic distances from tree
  D_data <- distance_input               # Observed distances (e.g. Hamming)
  score <- sum((D_tree - D_data)^2)      # Squared differences
  return(round(score, 2))                # Rounded result
}

# Compute and display LS scores for both BAL and OLS trees using D_hamming
cat("Least Squares Score (BAL):", LS_score(me_tree, D_hamming), "\n")

## Least Squares Score (BAL): 0.91

cat("Least Squares Score (OLS):", LS_score(ols_tree, D_hamming), "\n")

## Least Squares Score (OLS): 0.91

Challenge 06

Your task is to construct phylogenetic trees using both the BAL and OLS methods on the primates_14 dataset, calculating their ME and LS scores.

2.7 Maximum parsimony

Maximum Parsimony is a character-based method used in building phylogenetic trees. This method aims to identify the tree topology that requires the fewest evolutionary changes to explain the observed data—essentially, the simplest, most “parsimonious” tree. The optimisation process may take a significant amount of time for large datasets.

The processing of the four sequences example dataset can be summarised as follows:

suppressWarnings({
  suppressMessages({
    options(verbose = FALSE)
    
    # Distance Calculation
    # Compute the Hamming distance matrix for the given aligned sequences
    # This will serve as a measure of pairwise sequence dissimilarity
    D_hamming = dist.hamming(phyDat_msa_sample)

    # Base Tree Construction
    # Compute the Neighbor Joining (NJ) tree as a base tree
    # NJ is a distance-based method for constructing phylogenetic trees
    nj_tree = nj(D_hamming)

    # Correct Negative Branch Lengths
    # Negative branch lengths are not biologically meaningful,
    # hence set any negative branch lengths to 0
    nj_tree$edge.length[which(nj_tree$edge.length < 0)] = 0

    # Rooting the Tree
    # Root the NJ tree at its midpoint to ensure ultrametricity,
    # and convert any multifurcating nodes to bifurcating nodes
    nj_tree = midpoint(multi2di(nj_tree))

    # Parsimony optimisation
    # optimise the parsimony of the NJ tree
    # The optim.parsimony function refines the tree topology to minimize the parsimony score
    par_tree = optim.parsimony(nj_tree, phyDat_msa_sample)

    
    #Edge estimation
    par_tree=acctran(par_tree, phyDat_msa_sample)
    par_tree=midpoint(par_tree)
    
    # Calculating and Printing Parsimony
    # Calculate and print the parsimony score of the optimised tree
    # Lower parsimony scores indicate more parsimonious (simpler) trees
    print(paste("Parsimony Score for the optimised tree:", parsimony(par_tree, phyDat_msa_sample)))
  })
})

## [1] "Parsimony Score for the optimised tree: 6"

g_par_tree = plot_tree(par_tree, "Parsimony tree",2.8)
grid.arrange(g_par_tree, nrow = 1, ncol = 1)

Challenge 07

Your task is to construct phylogenetic trees, optimising for parsimony, using the “primates_14 dataset” and calculate the corresponding scores.

2.8 Maximum likelihood

Maximum Likelihood (ML) is a statistical approach used in phylogenetic reconstruction to find the tree topology that maximizes the probability of observing the given sequence data under a specific model of evolution. It evaluates different tree topologies and branch lengths to identify the most likely phylogenetic tree that explains the observed data.

Steps in Maximum Likelihood Method:

Model Selection. Choose an appropriate substitution model that best fits the sequence data (e.g., Jukes-Cantor, Kimura 2-parameter, GTR).
Tree Search. Explore different tree topologies and branch lengths to find the tree that maximizes the likelihood of the observed data.
Branch Length optimisation. For each tree topology, optimise the branch lengths to maximize the likelihood value.
Tree Evaluation. Assess the reliability of the inferred tree using methods such as bootstrap resampling or Bayesian posterior probabilities.

Implementation in R: In R, the phangorn package can be used to perform Maximum Likelihood estimation for phylogenetic reconstruction. The processing of the four sequences example dataset can be summarised as follows:

# Suppress warnings and messages
suppressWarnings({
  suppressMessages({
    options(verbose = FALSE)
    
    # Distance Calculation
    # Compute the Hamming distance matrix for the given aligned sequences
    # This will serve as a measure of pairwise sequence dissimilarity
    D_hamming = dist.hamming(phyDat_msa_sample)

    # Base Tree Construction
    # Compute the Neighbor Joining (NJ) tree as a base tree
    # NJ is a distance-based method for constructing phylogenetic trees
    nj_tree = nj(D_hamming)

    # Correct Negative Branch Lengths
    # Negative branch lengths are not biologically meaningful,
    # hence set any negative branch lengths to 0
    nj_tree$edge.length[which(nj_tree$edge.length < 0)] = 0
    
    # Model Testing
    # Conduct a model test to determine the best-fitting evolutionary model based on AIC
    mt = modelTest(phyDat_msa_sample, nj_tree, model = c("JC", "K80", "HKY", "GTR"))
    var_mt = attr(mt, "env")
    evolutionary_model = eval(get(mt$Model[which.min(mt$AIC)], var_mt), var_mt)
    
    # Initialize PML Tree
    # Create an initial PML (Phylogenetic Maximum Likelihood) tree using the NJ tree and selected evolutionary model
    initial_pml_tree = pml(tree = nj_tree, data = phyDat_msa_sample, model = evolutionary_model$model,
                  bf = evolutionary_model$bf, Q = evolutionary_model$Q, inv = evolutionary_model$inv,
                  k = evolutionary_model$k, shape = evolutionary_model$shape)
    
    # optimise PML Tree
    # optimise the PML tree through various parameters and tree rearrangement methods
    pml_tree = optim.pml(initial_pml_tree, model = evolutionary_model$model, optInv = T, optGamma = F,
                         rearrangement = "stochastic", optBf = F, optQ = F,
                         optEdge = F, optNni = TRUE, control = pml.control(trace = 0))
  })
})

## Model        df  logLik   AIC      BIC
##           JC 5 -31.34555 72.69109 73.67722 
##         JC+I 6 -31.34555 74.69109 75.87444 
##      JC+G(4) 6 -31.34656 74.69312 75.87647 
##    JC+G(4)+I 7 -31.34656 76.69312 78.07369 
##          K80 6 -31.34475 74.6895 75.87285 
##        K80+I 7 -31.34475 76.6895 78.07007 
##     K80+G(4) 7 -31.34583 76.69166 78.07223 
##   K80+G(4)+I 8 -31.34583 78.69166 80.26946 
##          HKY 8 -31.30343 78.60686 80.18465 
##        HKY+I 9 -31.29952 80.59903 82.37406 
##     HKY+G(4) 9 -31.30383 80.60767 82.38269 
##   HKY+G(4)+I 10 -31.30059 82.60118 84.57342 
##          GTR 12 -30.5576 85.1152 87.4819 
##        GTR+I 13 -29.81092 85.62184 88.18576 
##     GTR+G(4) 13 -30.18203 86.36405 88.92797 
##   GTR+G(4)+I 14 -29.81845 87.63689 90.39804

# Visualize the optimised PML Tree
# Create a visualization of the optimised PML tree with a specified title
pml_tree$tree = multi2di(pml_tree$tree)
g_pml_tree = plot_tree(pml_tree$tree, "Likelihood tree", 2.8)
grid.arrange(g_pml_tree, nrow = 1, ncol = 1)

# Print the selected evolutionary model with a descriptive message
cat("The parameters for this evolutionary model are:\n")

## The parameters for this evolutionary model are:

print(pml_tree)

## model: JC 
## loglikelihood: -31.34555 
## unconstrained loglikelihood: -19.77502 
## 
## Rate matrix:
##   a c g t
## a 0 1 1 1
## c 1 0 1 1
## g 1 1 0 1
## t 1 1 1 0
## 
## Base frequencies:  
##    a    c    g    t 
## 0.25 0.25 0.25 0.25

Challenge 08

Your task is to construct phylogenetic trees, optimising for likelihood, using the “primates_14 dataset” and calculate the corresponding scores.

2.9 Alignment-Free method

Alignment-free techniques for phylogenetic analysis have gained popularity due to their computational efficiency and scalability. K-mer based methods are a subset of these techniques that utilize substrings of fixed length k from the genomic sequences.

A k-mer is a substring of length k derived from a longer sequence. In phylogenetic analysis, k-mer frequencies within genomic sequences are compared to infer evolutionary relationships. This method is particularly useful for large datasets where traditional alignment methods are computationally expensive.

Procedure:

K-mer Counting. For each sequence in the dataset, extract all possible k-mers and count their occurrences.
Distance Calculation. Compute the pairwise distance between sequences based on their k-mer frequency profiles. Jaccard distance or cosine similarity can be used for this purpose.

3.Tree Construction. Use the calculated distance matrix to build a phylogenetic tree using methods like Neighbor-Joining or UPGMA.

The processing of the four sequences example dataset can be summarised as follows:

library("kmer")

kmer_distance = round(kdistance(phyDat_msa_sample, k = 5),2)

# Print Kmers distance matrix as an HTML table with a 20% width
kable(as.matrix(kmer_distance), format = "html", table.attr = "style='width:20%;'")

	Seq 1	Seq 2	Seq 3	Seq 4
Seq 1	0.00	0.25	1.00	1.00
Seq 2	0.25	0.00	1.00	1.00
Seq 3	1.00	1.00	0.00	0.39
Seq 4	1.00	1.00	0.39	0.00

# Compute the Neighbor Joining tree
nj_km_tree = nj(kmer_distance)
nj_km_tree$edge.length[which(nj_km_tree$edge.length<0)]=0
nj_km_tree = midpoint(multi2di(nj_km_tree))


# Visualize the optimised PML Tree
# Create a visualization of the optimised PML tree with a specified title
g_nj_km_tree = plot_tree(nj_km_tree, "NJ kmer tree", 2.8)
grid.arrange(g_nj_km_tree, nrow = 1, ncol = 1)

Challenge 09
Build a phylogenetic tree for ALKBH5 sequences using a k-mer-based strategy, without global alignment.

Hint: Use oligonucleotideFrequency() to compute the k-mer profiles for each sequence, and then apply dist() on the resulting matrix to estimate the pairwise distances.

2.10 Bootstrapping

Bootstrapping is a resampling technique used to estimate the reliability of the branches (clades) in a phylogenetic tree. It involves generating multiple pseudo-replicate datasets by randomly sampling characters with replacement and then inferring a tree for each dataset.

It can be used as follows:

# ----------------------------- Bootstrapped NJ Tree with Node Support -----------------------------

# Calculate the Hamming distance matrix for the given phylogenetic alignment
distance <- dist.hamming(phyDat_msa_sample)

# Build a Neighbor Joining (NJ) tree based on the distance matrix
nj_tree <- nj(distance)

# Resolve multifurcations (convert multifurcating nodes into bifurcating)
nj_tree <- multi2di(nj_tree)

# Midpoint root the tree
nj_tree <- midpoint(nj_tree)

# Define a bootstrap function that builds an NJ tree from a resampled dataset
bootstrap_fun <- function(x) nj(dist.hamming(x))

# Perform bootstrap analysis using 100 trees
bootstrapped_trees <- boot.phylo(nj_tree, as.DNAbin(phyDat_msa_sample), bootstrap_fun, trees = TRUE)

## Running bootstraps:       100 / 100
## Calculating bootstrap values... done.

# Add bootstrap support values to the internal nodes of the original NJ tree
bootstrapped_conf_tree <- addConfidences(nj_tree, bootstrapped_trees$trees)

# Replace NA support value at the root with 1 (if present)
na_index <- which(is.na(bootstrapped_conf_tree$node.label))
bootstrapped_conf_tree$node.label[na_index] <- 1

# ----------------------------- Tree Visualization -----------------------------

# Define a color palette for bootstrap values (gradient from red to green)
color_palette <- colorRampPalette(c("red3", "mediumseagreen"))(100)

# Map node support values (scaled 0–100) to the color palette
color_indices <- round(bootstrapped_conf_tree$node.label * 100)
node_colors <- color_palette[color_indices]

# Define tip label colors (highlight a specific tip, e.g., tip 111)
tip_colors <- rep("black", length(bootstrapped_conf_tree$tip.label))

# Plot the tree with bootstrap node support
g <- ggtree(bootstrapped_conf_tree, layout = "dendrogram", ladderize = TRUE)
g <- g + geom_tiplab(size = 4, color = tip_colors)  # Add tip labels with specified colors
g <- g + theme(plot.margin = margin(t = 0, r = 0, b = 150, l = 0, unit = "pt"), legend.position = "top")
g <- g + geom_nodepoint(size = 4, color = node_colors)  # Add colored nodes by support
g <- g + labs(title = "Representative consensus tree (medoid) with node support (100 trees)")
g <- g + theme(plot.title = element_text(size = 20))

# Create a data frame with values from 1 to 100 for color mapping
color_data = data.frame(x = 1:100, y = 1)
color_data$colors = color_palette

# Create a color bar plot using ggplot2
g1 = ggplot(color_data, aes(x = x, y = y, fill = colors)) +
  geom_bar(stat = "identity", width = 1, position = "identity") + scale_fill_manual(values = rev(color_data$colors)) +
  labs(title = NULL,
       x = "% of node support",
       y = NULL) +
  theme_minimal() + theme(axis.text.y = element_blank(),  # Remove y-axis labels
                          axis.title.y = element_blank(),
                          panel.grid = element_blank())  # Remove y-axis title
g1 = g1 + guides(fill = "none")

# Arrange the tree plot and the color bar plot
grid.arrange(g, g1, nrow = 2, ncol = 1, heights = c(0.8, 0.2))

Challenge 10

Apply bootstrapping to the NJ tree inferred using the primates_14 dataset.

2.11 Phylogenetic forest

In the diverse and intricate field of phylogenetic analysis, a Phylogenetic Forest represents an advanced approach to discern the evolutionary relationships among various species or genes. This methodology is not confined to a single tree, but rather it encompasses a collection of phylogenetic trees, each offering a unique perspective on the evolutionary history of the organisms under study. Each tree within this ‘forest’ is derived from distinct datasets or analytical methodologies, hence providing a comprehensive and multifaceted view of phylogeny.

The concept of a Phylogenetic Forest arises from the acknowledgment that a solitary tree might not capture the full complexity and variability inherent in evolutionary processes. The presence of conflicting signals, due to events like horizontal gene transfer, recombination, or incomplete lineage sorting, may lead to uncertainties and discrepancies in phylogenetic reconstruction. By analyzing a ‘forest’ of trees, researchers are better equipped to address these challenges and extract robust and reliable insights regarding evolutionary relationships.

In this case, we will create a phylogenetic forest using KMeans.

# ----------------------------- Load Libraries -----------------------------
library(clValid)      
library(factoextra)
library(ggplot2)

# ----------------------------- KMeans Method Function -----------------------------
kmeans_method <- function(tree_distance_matrix) {
  possible_k <- seq(2, 3)
  set.seed(2)
  clustering_quality_results <- NULL
  for (i in possible_k) {
    clusters <- kmeans(tree_distance_matrix, i)
    silhouette_score <- mean(silhouette(clusters$cluster, tree_distance_matrix)[,3])
    tmp <- c("Kmeans", i, round(silhouette_score, 3))
    clustering_quality_results <- rbind(clustering_quality_results, tmp)
  }
  colnames(clustering_quality_results) <- c("Method", "k", "Silhouette")
  clustering_quality_results <- data.frame(clustering_quality_results)
  clustering_quality_results$Silhouette <- as.numeric(clustering_quality_results$Silhouette)
  return(clustering_quality_results)
}

# ----------------------------- RF Distance -----------------------------
trees <- c(nj_tree, bootstrapped_trees$trees)
names(trees) <- c("original", paste0("bootstrap_", seq_along(bootstrapped_trees$trees)))

tree_distance_bm <- RF.dist(trees, normalize = TRUE, check.labels = FALSE)
tree_distance_bm <- round(as.matrix(tree_distance_bm), 2)

# ----------------------------- Clustering -----------------------------
quality_kmeans <- kmeans_method(tree_distance_bm)
optimal_k <- as.numeric(quality_kmeans$k[which.max(quality_kmeans$Silhouette)])
set.seed(2)
kmeans_results <- kmeans(tree_distance_bm, optimal_k)

# ----------------------------- Visualización Mejorada -----------------------------
# Extraer coordenadas y asignar nombres
coordinates <- fviz_cluster(kmeans_results, data = tree_distance_bm)$data
coordinates$tree_name <- rownames(tree_distance_bm)
coordinates$label <- ifelse(coordinates$tree_name == "original", "Original NJ", "Bootstrap")

# Visualización con etiqueta al NJ original
g <- fviz_cluster(kmeans_results, data = tree_distance_bm, ellipse.type = "norm",
                  labelsize = 10, repel = TRUE, show.clust.cent = FALSE,
                  ggtheme = theme_classic(), pointsize = 3, shape = "circle",
                  palette = c("#00A499", "#695c59", "lightgreen")) +
     labs(title = "Phylogenetic Forest - 4 Sequences") +
     theme(axis.text = element_text(size = 12), legend.position = "bottom") +
     geom_text(data = coordinates[coordinates$label == "Original NJ", ],
               aes(x = x, y = y, label = label), color = "black", size = 4, fontface = "bold")

# Plot final
plot(g)

Challenge 11

Create a phylogenetic forest (landscape) for the primate_14 data set using bootstrapping and an original NJ tree.