Tutorial - Bayesian Inference in BEAST2
Introduction
BEAST2 is a software to infer phylogenetic trees by Bayesian inference (BI). To learn more about BEAST see https://www.beast2.org/tutorials/. I prepared this tutorial with the goal of obtaining BI trees and used them as inputs for species delimitation analyses. We need to perform single-locus species delimitation analysis using mitochondrial and nuclear genes to evaluate whether cryptic species are present in our global amphibian dataset.
Getting started
The first step is to download BEAST2 from its website: https://www.beast2.org/. There are options to download the program for Windows, Mac, and Linux. In this tutorial, I first present the download and posterior analysis on a computer running Mac OS X. Another important part of this tutorial is to learn how to run BEAST remotely at CARC.
Note: It is very important to download the version 2.7.7 of BEAST to be compatible with version installed in CARC.
Once BEAST is downloaded, please move the new BEAST folder to the Applications folder.
Open the Beauti program
Beauti is a program within BEAST that imports our alignment in NEXUS format along with the sequences to be analyzed. We can open Beauti by double-clicking in the icon located in the folder or in the the Launchpad. In Beauti, additional information about the substitution model and other evolutionary parameters related to the search tree, as well as the Markov chain Monte Carlo (MCMC) algorithm, is provided. Beauti is the needed program to create the .xml file that serves as input for BEAST.
Load packages in Beauti
It is possible to install other BEAST2 programs using Beauti. We have to go to File, then to Manage Packages, and choose the packages we want to install. It is very important that Beauti be restarted after downloading the packages for proper package functionality.
Import alignment
In Beauti, go to File, click Import Alignment, then select the NEXUS alignment file you want to analyze. We can get this alignment file from AliView by saving our FASTA file in NEXUS format (as shown in the Figure). For this example, I am using a file with 43 CytB sequences of Acris crepitans.
Including the nucleotide substitution model
The next step is to include the best nucleotide substitution model obtained for our data, i.e. our alignment. For the same Acris crepitans data set, we calculated the model with the program ModelFinder implemented in IQ-TREE and the best model was HKY+F+G4. This model allows for not equal base frequencies, the F in the model (empirical base frequencies), with Gamma rate heterogeneity with 4 categories (G4). This information can be found in the .iqtree output file obtained from the IQ-TREE run. In this example, I opened it with the BBEdit text editor.
We need to include this information in the Site Model tab in Beauti. We choose the HKY model; we can change zero (0) to four (4) in the Gamma Category Count; this will display a new option called Shape, with a value of zero and the ‘estimate’ box activated with a checkmark. Since ModelFinder in IQ-TREE gives us the Shape value (Gamma shape alpha: 0.2763), we include it and uncheck the ‘estimate’ box. Finally, we change to Empirical in the Frequencies option. All other options can be left at their default values.
Molecular Clock
In the Clock Model tab, several options for the molecular clock appear. For the Acris crepitans data set, we choose the Strict Clock option because our alignment belongs to the same species (or population), and we can use a strict clock, which assumes that each sequence has the same substitution rate. We left the other values by default. We will encounter many scenarios in which our alignment will be composed of sequences from different species (or even genera). In this case, we need to choose the Optimised Relaxed Clock, which should be available if we previously installed the ORC package in Beauti (see Load packages in Beauti above). The relaxed clock model is widely used in phylogenetics and allows the estimation of different substitution rates among lineages.
Tree Priors
In Bayesian inference, we must specify prior probability distributions (priors) for the model parameters, including the tree (tree priors). In this example, we will use a Birth-Death model as the tree prior; this model is widely used to study how species (or individuals within a population) change over time (speciation and extinction processes). We choose this model over the Yule model (pure birth), which is generally used when there is only a single sequence per species in the alignment and is also frequently employed in cases where extinction rates are expected to be negligible. For this exercise, we can leave the remaining parameters unchanged.
Configuring MCMC Options
The final tab in Beauti is MCMC, which provides settings to control the operation of BEAST. First is the ‘Chain Length’ setting. Here, we specify the number of steps the MCMC chain will execute before terminating. This number depends on the size of our data set, the complexity of the model, and the desired quality of the results. The default value of 10,000,000 is entirely arbitrary and should therefore be adjusted based on the size of our specific data set. For instance, for this particular A. crepitans data set, I used 50 million. We will leave the ‘Store Every’, ‘Pre Burnin’, and ‘Num Initialization Attempts’ settings at their default values.
Below these general settings, we will find the logging settings. Each specific option can be viewed in detail by clicking the arrow on the left. We can control the names of the log files and the frequency with which values are stored in each file.
We will begin with the tracelog. The ‘Log Every’ parameter for the log file should be set relative to the total length of the chain. Sampling too frequently (e.g., every 100) will result in very large files with little additional benefit in terms of analytical precision. Sampling too sparsely (e.g., > 1000) means that the log file will not capture sufficient information regarding parameter distributions. For this analysis, we will configure BEAST2 to write to the log file every 1000 samples. We will leave the File Name unchanged.
Next, we expand the options to view the screenlog settings. We leave the default values unchanged, that is, the ‘File Name’ field blank and ‘Log Every’ set to 1000. Finally, we can modify the treelog options by clicking the left arrow; here, we do change the name in the File Name field (to Acrepitans.trees) while leaving the remaining options at their default settings.
Generating the XML File
We are now ready to create the BEAST2 XML file. This is the final configuration file that BEAST2 can use to run the analysis. We will save the XML file as Acrepitans.xml using File > Save. We are now ready to run the file using BEAST2.
Running BEAST in our computer
Now we run BEAST2 and select the XML file (Acrepitans.xml) from the location where we saved it. Optionally, we may also change the seed number, in the Random number seed option for the run. Optional: Check the Beagle Info if BEAGLE has been installed previously, this will allow the analysis to run more quickly. We launch BEAST2 by clicking the Run button.
BEAST2 will run until it reaches the specified number of steps in the chain. While running, it will print screenlog values to a console and store tracelog and treelog values in files located in the same folder as the XML file.
Running BEAST in CARC
If our alignment is very large (more than 200 sequences), we can run BEAST2 on CARC by using a Slurm script, which is a batch script used to define, submit, and manage computational tasks on a high-performance computing (HPC) cluster, such as CARC.
First, we need to copy our XML file generated in Beauti from local (our computer) to remote (CARC).
In our computer:
$ scp Acrepitans.xml lamador@hopper.alliance.unm.edu:/users/lamador/Documents/global_amph/BEAST2We can use Easley or Hopper clusters to run BEAST2. Xena cluster was retired on February 28, 2026.
$ ssh username@easley.alliance.unm.eduOnce in CARC, the first thing we do is navigate to the folder where we saved the XML file. Then, we open nano, and copy the script below. To saved it as a Slurm script, we do Ctrl + O, named the script (for example: AcrepitansBEAST.sh), enter, and Ctrl + X to exit nano.
#!/bin/bash
#SBATCH --partition=general
#SBATCH --job-name=AcrepitansBEAST # Job name
#SBATCH --output=_out_%j # Output file
#SBATCH --error=_err_%j
#SBATCH --ntasks=1 # Run a single task
#SBATCH --cpus-per-task=16
#SBATCH --time=24:00:00 # Time limit hrs:min:sec
#SBATCH --account=2016221
#SBATCH --mail-user=laamador@unm.edu
module load beast
beast Acrepitans.xmlRemember to use your own email and change the name of the species and XML file. Finally, we can submit the script using the sbatch command:
$ sbatch AcrepitansBEAST.shSee https://github.com/UNM-CARC/QuickBytes/blob/master/Beast_at_CARC.md for more information.
When BEAST finishes running in CARC, we need to copy two important files from remote to local. The following code will copy the files in the current directory of our computer, the Documents folder in the example.
In our computer:
~/Documents$ scp lamador@hopper.alliance.unm.edu:/users/lamador/Documents/global_amph/BEAST2/Acrepitans.log lamador@hopper.alliance.unm.edu:/users/lamador/Documents/global_amph/BEAST2/Acrepitans.trees .Checking for Convergence
To determine whether a specific chain length is adequate, the .log file resulting from the BEAST analysis can be analyzed using the program Tracer. The objective in setting the Chain Length in Beauti is to achieve a reasonable Effective Sample Size (ESS) (> 200).
Obtaining the Tree
We will now focus on obtaining a Bayesian tree, specifically, the maximum clade credibility tree. BEAST2 produces a posterior sample of phylogenetic trees (output file: .trees). These trees must be summarized before any conclusions can be drawn regarding the quality of the phylogenetic estimation. One method for summarizing these trees is to use the TreeAnnotator program, which is included with BEAST. This tool processes the set of trees to identify the tree with the strongest support. It also calculates the posterior clade probability for each node. This resulting tree (.tree or .tre) is referred to as the maximum clade credibility tree.
In TreeAnnotator we set the Burn-in percentage to 10% to discard the first 10% of the trees in our .trees file. We leave the Posterior probability limit unchanged, that is, at zero, which means that TreeAnnotator will annotate all nodes. We leave the next option at its default setting, and under Node heights, we select Mean heights. Finally, we must select the input file (the Acrepitans.trees file from our BEAST analysis), specify an output file named Acrepitans_MCC.tree, and run the program.
Tree Visualization
We can visualize the tree using one of the available software programs, such as FigTree. We open FigTree and then locate the Acrepitans_MCC.tree file. We can edit the appearance of the tree, check the node support, or save it in Newick or Nexus format for further analysis.
Is it ultrametric?
We can view our tree, but we do not yet know whether or not it is ultrametric. To determine if the tree we obtained in BEAST is ultrametric (a main requirement for species delimitation in GMYC), we can use R, specifically the ape package. We load our tree using the read.nexus function and then query R to see if our tree is ultrametric using the is.ultrametric function; if the response is ‘TRUE’, it means our tree is indeed ultrametric, as demonstrated in this line of code.
#install.packages("ape")
library(ape)
tree <- read.nexus(file = "Acrepitans_MCC.tree")
is.ultrametric(tree)
## [1] TRUEFollowing this tutorial we will obtain a decent Bayesian phylogenetic tree that will be used for single-locus species delimitation analyses.