Final Project: CSB1020 Intro to R

Background
Tutorial
Conclusions
Appendices
- Appendix 1: Session Information
- Appendix 2: References

Background

Photosynthesis is a difficult and delicate operation that exerts huge selective pressure on large tracts of the plant genome to remain conserved. The plastid genome, for example, codes for key portions of the photosynthetic apparatus and so is usually very similar in terms of gene order and composition across the angiosperm phylogeny. When the spectre of photosynthesis is removed, though, as in the case of heterotrophic plants, these genomes have been known to undergo rapid evolution. Since heterotrophy has evolved independently several times in angiosperms (and always from ancestors that were photosynthetic), it is informative to identify repeated patterns in the accelerated plastome change exhibitd by different lineages of heterotrophic plants. Such analyses help us understand how the genome responds to drastic life history changes, and also allows us to identify genes and gene families that were previously thought to be purely photosynthetic in function but may actually have a dual role that makes them indispensible.

In some cases, sequence change in parasites compared to green relatives is drastic and easy to identify. Genes may be severely truncated, pseudogenized, or altogether missing. In other cases, the changes, if they exist, are far more subtle and usually take the form of higher-than-normal sequence diversity. However, it is often difficult to tell how much (if any) of the sequence diversity can be attributed to a transition to parasitism instead of simple phylogenetic distance. In these cases, conventional dN/dS (\(\omega\)) analyses of heterozygosity need to be considered, but may be misleading in isolation since pairwise comparisons with an outgroup don’t take into account the phylogenetic relationships in the ingroup of interest.

To tackle this problem, I’m also using a piece of software called FUBAR (a fast, unconstrained Bayesian approximation for inferring selection) (Murrell et al., 2013) which has the potential to provide unified heterozygosity signals across a given phylogeny (i.e., one for the entire phylogeny instead of individual signals for each OTU). The raw output from FUBAR takes the form of very rough codon-by-codon selection data (see attached files for examples) that then need to be wrangled and analysed to arrive at the aforementioned unified signals.

This tutorial takes two forms of input files: (1) FASTA files of pairwise gene alignments for \(\omega\) estimation, and (2) CSV files outputted by FUBAR. It walks through how both measures of hetrozygosity may be calculated from these data and then plots and analyses these results to be able to draw biologically meaningful conclusions. The reason this tutorial recommends the use of R for these analyses is that programmatic methods are more robust when dealing with large and unforgiving datasets like the ones used here.

Tutorial

Workspace Setup

Several R packages are required for this tutorial.

packages <- c("tidyverse", "dplyr", "stringr", "rebus", "ape", "kableExtra")

You can install these packages, if needed, by uncommenting the first line of the code below. Load the packages required for this tutorial.

# install.packages(packages, dependencies = TRUE)

lapply(packages, library, character.only = TRUE)

The working directory needs to be set to an appropriate location of the user’s choice where all the input files required for this tutorial can be accessed.

setwd("C:/Users/owner/Dropbox/Artemis_Artemis-L_Fileshare/CSB1020_Intro_to_R/Final_Project")

Part 1: Calculating \(\omega\) from gene alignment files

Reading gene alignment files

We’ll use the ‘read.FASTA’ function from the “ape” package to read the alignment files.

accD.fa <- read.FASTA("./ceratophorae_accD_alignment.fasta", type = "DNA")
atpB.fa <- read.FASTA("./ceratophorae_atpB_alignment.fasta", type = "DNA")
atpH.fa <- read.FASTA("./ceratophorae_atpH_alignment.fasta", type = "DNA")
clpP.fa <- read.FASTA("./ceratophorae_clpP_alignment.fasta", type = "DNA")
rpl20.fa <- read.FASTA("./ceratophorae_rpl20_alignment.fasta", type = "DNA")
ycf1.fa <- read.FASTA("./ceratophorae_ycf1_alignment.fasta", type = "DNA")
ycf2.fa <- read.FASTA("./ceratophorae_ycf2_alignment.fasta", type = "DNA")

# Note that each gene alignment has been imported as a list by the read.FASTA function.
str(accD.fa)

## List of 9
##  $ Cuscuta_obtusiflora_EU189133 Cuscuta obtusiflora chloroplast, complete genome: raw [1:1881] 04 04 04 04 ...
##  $ Cuscuta_boldinghii_1649                                                      : raw [1:1881] 04 04 04 04 ...
##  $ Cuscuta_bonafortunae_1221                                                    : raw [1:1881] 04 04 04 04 ...
##  $ Cuscuta_carnosa_1008                                                         : raw [1:1881] 04 04 04 04 ...
##  $ Cuscuta_chapalana_1222                                                       : raw [1:1881] 04 04 04 04 ...
##  $ Cuscuta_costaricensis_811                                                    : raw [1:1881] 04 04 04 04 ...
##  $ Cuscuta_erosa_1172                                                           : raw [1:1881] 88 18 48 88 ...
##  $ Cuscuta_mexicana_1184                                                        : raw [1:1881] 04 04 04 04 ...
##  $ Cuscuta_strobilacea_1227                                                     : raw [1:1881] 04 04 04 04 ...
##  - attr(*, "class")= chr "DNAbin"

The sequence identifiers for these files are inconsistent and long. If the \(\omega\) analyses were run using these identifiers, the results would be cumbersome to parse. Thus, the following block of code changes the sequence identifiers to be more managable.

# changing identifiers for accD

identifiers = NULL

for (i in 1:length(accD.fa)) {
  ident <- names(accD.fa[i])
  ident <- str_remove(ident, pattern = "Cuscuta_")
  ident <- str_remove(ident, pattern = ("_" %R% one_or_more(or(DGT, WRD, SPC, ",")) %R% END))
  identifiers <- c(identifiers, ident)
}

names(accD.fa) <- identifiers

# Since the sequences in this case were in the same order, we can just use the same variable for all the genes. If your files are not in the same order, you would need to run the for loop for each one.
names(atpB.fa) <- identifiers
names(atpH.fa) <- identifiers
names(clpP.fa) <- identifiers
names(rpl20.fa) <- identifiers 
names(ycf1.fa) <- identifiers
names(ycf2.fa) <- identifiers

# Now all our sequence identifiers are consistent and short.
print(names(ycf1.fa))

## [1] "obtusiflora"   "boldinghii"    "bonafortunae"  "carnosa"      
## [5] "chapalana"     "costaricensis" "erosa"         "mexicana"     
## [9] "strobilacea"

Calculating \(\omega\) values

Now we can run the \(\omega\) analyses. This technique calculates heterozygosity pairwise for each ingroup species against an outgroup. For the datasets used in this tutorial, the outgroup is Cuscuta obtusiflora, which is always the first species in our alignment lists.

accD.omega = NULL #This is the variable in which we'll capture the pairwise omega values for accD

for (i in 2:length(accD.fa)) {
  alignment <- c(accD.fa[1], accD.fa[i]) #This creates a pairwise alignment with our outgroup and each ingroup
  print(dnds(alignment, code = 1, codonstart = 1, quiet = TRUE))
  print("", quote = FALSE) #This is just to make the output look more appealing
  accD.omega <- c(accD.omega, dnds(alignment, quiet = TRUE))
}

##            obtusiflora
## boldinghii   0.5796014
## [1] 
##              obtusiflora
## bonafortunae   0.5675801
## [1] 
##         obtusiflora
## carnosa   0.5742759
## [1] 
##           obtusiflora
## chapalana   0.5434265
## [1] 
##               obtusiflora
## costaricensis   0.5663989
## [1] 
##       obtusiflora
## erosa   0.6914863
## [1] 
##          obtusiflora
## mexicana   0.6720022
## [1] 
##             obtusiflora
## strobilacea   0.4986071
## [1]

The following block of code does the same for all of the other genes, but without printing out the results.

atpB.omega = NULL
atpH.omega = NULL
clpP.omega = NULL
rpl20.omega = NULL
ycf1.omega = NULL
ycf2.omega = NULL

for (i in 2:length(atpB.fa)) {
  alignment <- c(atpB.fa[1], atpB.fa[i])
  atpB.omega <- c(atpB.omega, dnds(alignment, quiet = TRUE))
}

for (i in 2:length(atpH.fa)) {
  alignment <- c(atpH.fa[1], atpH.fa[i])
  atpH.omega <- c(atpH.omega, dnds(alignment, quiet = TRUE))
}

for (i in 2:length(clpP.fa)) {
  alignment <- c(clpP.fa[1], clpP.fa[i])
  clpP.omega <- c(clpP.omega, dnds(alignment, quiet = TRUE))
}

for (i in 2:length(rpl20.fa)) {
  alignment <- c(rpl20.fa[1], rpl20.fa[i])
  rpl20.omega <- c(rpl20.omega, dnds(alignment, quiet = TRUE))
}

for (i in 2:length(ycf1.fa)) {
  alignment <- c(ycf1.fa[1], ycf1.fa[i])
  ycf1.omega <- c(ycf1.omega, dnds(alignment, quiet = TRUE))
}

for (i in 2:length(ycf2.fa)) {
  alignment <- c(ycf2.fa[1], ycf2.fa[i])
  ycf2.omega <- c(ycf2.omega, dnds(alignment, quiet = TRUE))
}

Plotting \(\omega\) values

Now we can put all the pairwise \(\omega\) values we’ve generated into a data frame.

omega.df <- data.frame("Species" = identifiers[2:9], "accD" = accD.omega, "atpB" = atpB.omega, "atpH" = atpH.omega, "clpP" = clpP.omega, "rpl20" = rpl20.omega, "ycf1" = ycf1.omega, "ycf2" = ycf2.omega)

kable(head(omega.df), format = "pandoc", align = "c", digits = 4)

Species	accD	atpB	atpH	clpP	rpl20	ycf1	ycf2
boldinghii	0.5796	0.1887	0.1996	0.2237	0.4844	0.5574	0.6431
bonafortunae	0.5676	0.2046	0.2033	0.0641	0.3282	0.5681	0.7513
carnosa	0.5743	0.2313	0.1395	0.1035	0.3476	0.5705	0.6317
chapalana	0.5434	0.1854	0.1825	0.0534	0.3831	0.5953	0.6947
costaricensis	0.5664	0.1249	0.2055	0.0898	0.3785	0.5318	0.6524
erosa	0.6915	0.2190	0.1221	0.1842	0.4746	0.5585	0.6462

# Converting the wide format data frame generated above to a long format data frame that is more useful for plotting.

omega.df <- omega.df %>% gather(key = "Gene", value = "dNdS", convert = TRUE, factor_key = FALSE, -Species)

kable(head(omega.df), format = "pandoc", align = "c", digits = 4)

Species	Gene	dNdS
boldinghii	accD	0.5796
bonafortunae	accD	0.5676
carnosa	accD	0.5743
chapalana	accD	0.5434
costaricensis	accD	0.5664
erosa	accD	0.6915

Using this data frame, we can then create a facetted bar chart for our data.

ggplot(omega.df, aes(x = Species, y = dNdS)) +
  geom_col(width = 0.8) +
  facet_grid(.~Gene) +
  theme_linedraw() +
  theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.4, size = 7.5)) +
  theme(panel.grid.major.x = element_blank())

This plot shows that the \(\omega\) values for all of our genes in all of our species are less than 1. All these genes, then, are under purifying selection. However, genes atpB, atpH and clpP are under stronger purifying selection than the others. Could this mean that the selection on the other genes is weakening due to an absence of photosynthesis? Or that these three genes have some sort of other function in the plant cell that makes them indispensible?

Part 2: Analysing FUBAR output files

Reading FUBAR output files

Unlike the \(\omega\) analyses above, FUBAR supplies one output for the phylogeny as a whole for each gene (instead of pairwise by species). The outputs are datasets that contain codon-by-codon selection information. The phylogeny provided to FUBAR is shown below for your information.

The first step in analyzing these outputs is to read the CSV files they’re in.

accD.fb <- read.csv("./accD_align.nex.fubar.csv", header = TRUE, stringsAsFactors = FALSE)[1:6] #Reading only the pertinent columns
colnames(accD.fb) <- c("codon", "alpha", "beta", "diff", "prob_beta_more", "prob_alpha_more") #Changing column names

atpB.fb <- read.csv("./atpB_align.nex.fubar.csv", header = TRUE, stringsAsFactors = FALSE)[1:6]
colnames(atpB.fb) <- c("codon", "alpha", "beta", "diff", "prob_beta_more", "prob_alpha_more")

atpH.fb <- read.csv("./atpH_align.nex.fubar.csv", header = TRUE, stringsAsFactors = FALSE)[1:6]
colnames(atpH.fb) <- c("codon", "alpha", "beta", "diff", "prob_beta_more", "prob_alpha_more")

clpP.fb <- read.csv("./clpP_align.nex.fubar.csv", header = TRUE, stringsAsFactors = FALSE)[1:6]
colnames(clpP.fb) <- c("codon", "alpha", "beta", "diff", "prob_beta_more", "prob_alpha_more")

rpl20.fb <- read.csv("./rpl20_align.nex.fubar.csv", header = TRUE, stringsAsFactors = FALSE)[1:6]
colnames(rpl20.fb) <- c("codon", "alpha", "beta", "diff", "prob_beta_more", "prob_alpha_more")

ycf1.fb <- read.csv("./ycf1_align.nex.fubar.csv", header = TRUE, stringsAsFactors = FALSE)[1:6]
colnames(ycf1.fb) <- c("codon", "alpha", "beta", "diff", "prob_beta_more", "prob_alpha_more")

ycf2.fb <- read.csv("./ycf2_align.nex.fubar.csv", header = TRUE, stringsAsFactors = FALSE)[1:6]
colnames(ycf2.fb) <- c("codon", "alpha", "beta", "diff", "prob_beta_more", "prob_alpha_more")

Quantifying selection based on FUBAR results

To assess the kind of selection each gene is under, we have to quantify for how many codons \(\alpha\)>\(\beta\) (indicating purifying selection) vs. \(\beta\)>\(\alpha\) (indicating positive selection). Based on literature recommendation, we will use a significance threshold of 10%.

accD_pur = 0 #This variable will count the number of codons under purifying selection
accD_pos = 0 #This variable will count the number of codons under positive selection

for (i in 1:nrow(accD.fb)) {
  if (accD.fb$prob_beta_more[i]>=0.9)
    accD_pos <- accD_pos + 1
  if (accD.fb$prob_alpha_more[i]>=0.9)
    accD_pur <- accD_pur + 1
}

print(accD_pur)

## [1] 12

print(accD_pos)

## [1] 4

So for accD in this phylogeny, there are 12 codons significantly under purifying selection and 4 codons significantly under positive selection.

Now to conduct the same analysis for the remaining genes.

atpB_pur = 0; atpB_pos = 0

for (i in 1:nrow(atpB.fb)) {
  if (atpB.fb$prob_beta_more[i]>=0.9)
    atpB_pos <- atpB_pos + 1
  if (atpB.fb$prob_alpha_more[i]>=0.9)
    atpB_pur <- atpB_pur + 1
}

atpH_pur = 0; atpH_pos = 0

for (i in 1:nrow(atpH.fb)) {
  if (atpH.fb$prob_beta_more[i]>=0.9)
    atpH_pos <- atpH_pos + 1
  if (atpH.fb$prob_alpha_more[i]>=0.9)
    atpH_pur <- atpH_pur + 1
}

clpP_pur = 0; clpP_pos = 0

for (i in 1:nrow(clpP.fb)) {
  if (clpP.fb$prob_beta_more[i]>=0.9)
    clpP_pos <- clpP_pos + 1
  if (clpP.fb$prob_alpha_more[i]>=0.9)
    clpP_pur <- clpP_pur + 1
}

rpl20_pur = 0; rpl20_pos = 0

for (i in 1:nrow(rpl20.fb)) {
  if (rpl20.fb$prob_beta_more[i]>=0.9)
    rpl20_pos <- rpl20_pos + 1
  if (rpl20.fb$prob_alpha_more[i]>=0.9)
    rpl20_pur <- rpl20_pur + 1
}

ycf1_pur = 0; ycf1_pos = 0

for (i in 1:nrow(ycf1.fb)) {
  if (ycf1.fb$prob_beta_more[i]>=0.9)
    ycf1_pos <- ycf1_pos + 1
  if (ycf1.fb$prob_alpha_more[i]>=0.9)
    ycf1_pur <- ycf1_pur + 1
}

ycf2_pur = 0; ycf2_pos = 0

for (i in 1:nrow(ycf2.fb)) {
  if (ycf2.fb$prob_beta_more[i]>=0.9)
    ycf2_pos <- ycf2_pos + 1
  if (ycf2.fb$prob_alpha_more[i]>=0.9)
    ycf2_pur <- ycf2_pur + 1
}

Putting the FUBAR results into a table.

numcodons = c(nrow(accD.fb), nrow(atpB.fb), nrow(atpH.fb), nrow(clpP.fb), nrow(rpl20.fb), nrow(ycf1.fb), nrow(ycf2.fb))
numpur = c(accD_pur, atpB_pur, atpH_pur, clpP_pur, rpl20_pur, ycf1_pur, ycf2_pur)
numpos = c(accD_pos, atpB_pos, atpH_pos, clpP_pos, rpl20_pos, ycf1_pos, ycf2_pos)

fubar_results <- data.frame("Genes" = c("accD", "atpB", "atpH", "clpP", "rpl20", "ycf1", "ycf2"), "Codons" = numcodons, "Codons_under_Purifying_Selection" = numpur, "Codons_under_Positive_Selection" = numpos)

kable(fubar_results, format = "pandoc", align = "c")

Genes	Codons	Codons_under_Purifying_Selection	Codons_under_Positive_Selection
accD	627	12	4
atpB	494	124	0
atpH	81	9	0
clpP	198	6	0
rpl20	128	5	4
ycf1	1815	122	1
ycf2	1952	31	16

From these results we can see that all genes seem to be under purifying selection except for rpl20 which seems to be under neutral selection.

Statistical test of selection across the genome based on FUBAR results

To see if there’s a statistical difference in the degree of purifying selection across the genes sampled here, we’re going to run a one-way ANOVA. This test is chosen for us by our data because we have one continuous and one discrete variable that we’re comparing. The first step is to create an appropriate data frame.

# Creating the 'gene' column in each .fb data frame to enable the creation of a unified data frame
accD.fb <- accD.fb %>% mutate(gene = "accD")
atpB.fb <- atpB.fb %>% mutate(gene = "atpB")
atpH.fb <- atpH.fb %>% mutate(gene = "atpH")
clpP.fb <- clpP.fb %>% mutate(gene = "clpP")
rpl20.fb <- rpl20.fb %>% mutate(gene = "rpl20")
ycf1.fb <- ycf1.fb %>% mutate(gene = "ycf1")
ycf2.fb <- ycf2.fb %>% mutate(gene = "ycf2")

# Creating the unified dataset
purtable <- data.frame("Gene" = c(accD.fb$gene, atpB.fb$gene, atpH.fb$gene, clpP.fb$gene, rpl20.fb$gene, ycf1.fb$gene, ycf2.fb$gene), "Prob_Pur" = c(accD.fb$prob_beta_more, atpB.fb$prob_beta_more, atpH.fb$prob_beta_more, clpP.fb$prob_beta_more, rpl20.fb$prob_beta_more, ycf1.fb$prob_beta_more, ycf2.fb$prob_beta_more), stringsAsFactors = FALSE)

kable(head(purtable), format = "pandoc", align = "c")

Gene	Prob_Pur
accD	0.3866596
accD	0.3866596
accD	0.3866596
accD	0.3866596
accD	0.3866596
accD	0.3866596

Now we need to test our data to see if it meets the assumptions of ANOVA, which are the following:

Sample Independence
Normality
Homoscedasticity

Our first assumption (independence) is a difficult one to evaluate because even though these genes are separate and non-overlapping, they do come from the same assembly and high-throughput sequencing run. For the purposes of this tutorial, let us move forward assuming these data are independent.

We can perform more concrete tests for normality, though.

#First we create a histogram of purtable$Prob_Pur to visualize the data.
ggplot(purtable, aes(x = Prob_Pur)) +
  geom_histogram(binwidth = 0.01) +
  theme_bw()

# Checking the number of data points that we have.
nrow(purtable)

## [1] 5295

# Now we conduct a Kolmogorov-Smirnov test of normality. We cannot run the more traditional Shapiro-Wilk test because we have more than 5000 data points. Our null hypothesis is that these data are normally distributed.
ks.test(x = purtable$Prob_Pur, y = pnorm, alternative = "two.sided", exact = NULL)

## 
##  One-sample Kolmogorov-Smirnov test
## 
## data:  purtable$Prob_Pur
## D = 0.50024, p-value < 2.2e-16
## alternative hypothesis: two-sided

The p-value from the KS test is far below our threshold of 0.05. This means that we reject the null hypothesis that our data is normally distributed.

To visualize this further, let’s see what the residuals look like in a Quantile-Quantile plot.

qqnorm(purtable$Prob_Pur)

Since our one-way ANOVA assumptions are violated, we’ll have to use a non-parametric test instead. The Kruskal-Wallis rank sum test does not require normality or homoscedasticity of data and so can be used here with no issues. Our null hypothesis is that there are no significant differences between our treatment groups.

kruskal.test(Prob_Pur~Gene, data = purtable)

## 
##  Kruskal-Wallis rank sum test
## 
## data:  Prob_Pur by Gene
## Kruskal-Wallis chi-squared = 1256, df = 6, p-value < 2.2e-16

The p-value is far below our significance threshold of 0.05. This means that we reject the null hypothesis that there are no sigificant differences between our treatment groups. In other words, the codon-by-codon probability of purifying selection is not consistent across the genes we have sampled.

However, the result above does not tell us which genes are undergoing differences in purifying selection relative to which others. In order to find that information, we need to use a post-hoc test. In this tutorial, we’ll use a Wilcoxon rank sum test to generate pairwise comparisons.

pairwise.wilcox.test(purtable$Prob_Pur, purtable$Gene, p.adjust.method = "BH")

## 
##  Pairwise comparisons using Wilcoxon rank sum test 
## 
## data:  purtable$Prob_Pur and purtable$Gene 
## 
##       accD    atpB    atpH    clpP    rpl20   ycf1   
## atpB  < 2e-16 -       -       -       -       -      
## atpH  1.6e-08 3.0e-05 -       -       -       -      
## clpP  3.0e-07 < 2e-16 0.023   -       -       -      
## rpl20 0.210   < 2e-16 4.4e-08 9.7e-09 -       -      
## ycf1  3.3e-13 < 2e-16 0.020   0.996   7.0e-05 -      
## ycf2  < 2e-16 < 2e-16 < 2e-16 < 2e-16 4.8e-14 < 2e-16
## 
## P value adjustment method: BH

This test outputs a matrix that allows us to evaluate which genes show significantly different purifying selection relative to another gene. For example, rpl20 and accD don’t show a significant difference relative to each other (p=0.21), but atpH and clpP do (p=0.023).

Conclusions

After calculating and plotting \(\omega\) values for our genes of interest in Part 1, and then analyzing the results from FUBAR for the same genes in Part 2, we can confidently conclude that these genes are not under the effect of positive selection in response to these plants’ life history switch from photosynthetic to parasitic. Even though these genes are present in the plastid genome, and even though some of them are thought to be purely photosynthetic in function, it seems that there is selection acting to maintain these sequences in the genome. This may speak to these plants being cryptically photosynthetic (i.e., making use of limited photosynthesis at some point in their life cycle), or these results may be indicating that the genes sampled here have some sort of secondary functionality that makes them useful even in the absence of autotrophy.

Like these genes to their plants, I hope that this tutorial is somewhat useful to its users. If you have any questions or comments, I can be reached at arjan.banerjee@mail.utoronto.ca or at the profile ‘abanban93’ on GitHub.

Appendices

Appendix 1: Session Information

The following is a summary of how the R session was set up for this tutorial.

## R version 3.6.1 (2019-07-05)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 18362)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=English_Canada.1252  LC_CTYPE=English_Canada.1252   
## [3] LC_MONETARY=English_Canada.1252 LC_NUMERIC=C                   
## [5] LC_TIME=English_Canada.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] kableExtra_1.1.0 ape_5.3          rebus_0.1-3      forcats_0.4.0   
##  [5] stringr_1.4.0    dplyr_0.8.3      purrr_0.3.3      readr_1.3.1     
##  [9] tidyr_1.0.0      tibble_2.1.3     ggplot2_3.2.1    tidyverse_1.3.0 
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.2            lubridate_1.7.4       lattice_0.20-38      
##  [4] assertthat_0.2.1      zeallot_0.1.0         digest_0.6.21        
##  [7] plyr_1.8.4            R6_2.4.0              cellranger_1.1.0     
## [10] backports_1.1.5       reprex_0.3.0          evaluate_0.14        
## [13] httr_1.4.1            highr_0.8             pillar_1.4.2         
## [16] rlang_0.4.2           lazyeval_0.2.2        readxl_1.3.1         
## [19] rstudioapi_0.10       rebus.base_0.0-3      rmarkdown_1.16       
## [22] labeling_0.3          webshot_0.5.2         munsell_0.5.0        
## [25] broom_0.5.2           compiler_3.6.1        modelr_0.1.5         
## [28] xfun_0.10             pkgconfig_2.0.3       htmltools_0.4.0      
## [31] tidyselect_0.2.5      rebus.numbers_0.0-1   viridisLite_0.3.0    
## [34] crayon_1.3.4          dbplyr_1.4.2          withr_2.1.2          
## [37] grid_3.6.1            nlme_3.1-140          jsonlite_1.6         
## [40] gtable_0.3.0          lifecycle_0.1.0       DBI_1.0.0            
## [43] magrittr_1.5          scales_1.1.0          cli_1.1.0            
## [46] stringi_1.4.3         farver_2.0.1          reshape2_1.4.3       
## [49] fs_1.3.1              xml2_1.2.2            ellipsis_0.3.0       
## [52] generics_0.0.2        vctrs_0.2.0           tools_3.6.1          
## [55] glue_1.3.1            rebus.unicode_0.0-2   hms_0.5.2            
## [58] rebus.datetimes_0.0-1 parallel_3.6.1        yaml_2.2.0           
## [61] colorspace_1.4-1      rvest_0.3.5           knitr_1.25           
## [64] haven_2.2.0

Appendix 2: References

The data for this Tutorial has been taken from Ref 1 below. FUBAR was created by the authors of Ref 2 below and more information about it can be found in that paper. \(\omega\) was calculated in this tutorial using the package APE (Analysis of Phylogenetics and Evolution with R) and the authors for this package are cited in Ref 3 - Ref 6.

Ref 1: Banerjee, A. and Stefanovic, S. (2019) Caught in action: fine-scale plastome evolution in the parasitic plants of Cuscuta section Ceratophorae (Convolvulaceae). Plant Mol Biol, 100, 621-634.

Ref 2: Murrell, B., Moola, S., Mabona, A., Weighill, T., Sheward, D., Kosakovsky Pond, S.L. and Scheffler, K. (2013) FUBAR: a fast, unconstrained bayesian approximation for inferring selection. Mol Biol Evol, 30, 1196-1205.

Ref 3: Paradis, E. (2012) Analysis of Phylogenetics and Evolution with R (Second Edition). New York: Springer.

Ref 4: Paradis, E., Claude, J. and Strimmer, K. (2004) APE: analyses of phylogenetics and evolution in R language. Bioinformatics, 20, 289â“290.

Ref 5: Popescu, A.-A., Huber, K. T. and Paradis, E. (2012) ape 3.0: new tools for distance based phylogenetics and evolutionary analysis in R. Bioinformatics, 28, 1536â“1537.

Ref 6: Paradis, E. and Schliep, K. (2019) ape 5.0: an environment for modern phylogenetics and evolutionary analyses in R. Bioinformatics, 35, 526â“528.