This document should provide you with the basic steps to do the ancestral state reconstruction of the FL DoH subtype B HIV dataset. To oriantate yourself Im providing a small South African cluster as an example. But for the FL analyses you will need the following:

  • A .fasta file with all the sequences
  • A ML tree topology in newick format
  • A .csv or .tsv file containing the metadata (date and locations)

NOTE! - Make sure that the IDs in the fasta file, in the tree file and in the metadata table are exactly the same

We will perform the dating and ancestral state reconstruction in TreeTime from Richard Neher’s lab.

Installing treetime

Download a zip copy from the github link. Move the directory to where you want it and then once inside the folder you can just run:

pip install .

This should install treetime and all dependencies. Once installed you can run treetime to test.

Dating your tree

OK, now the first step would be to turn your ML tree topology into a dated topology (i.e. branches are in calander time). This is done with the base treetime function. Something like this:

treetime --aln <input.fasta> -- tree <input.nwk> -- dates <dates.csv>

Im using a single metadata table containing both the dates and locations so my example works like this:

treetime --aln 23106.fasta --tree 23106.phy.tbe.tree --dates 23106.metadata.csv 

And then you should see something like this:

Attempting to parse dates...
    Using column 'name' as name. This needs match the taxon names in the tree!!
    Using column 'date' as date.

0.00    -TreeAnc: set-up

1.95    TreeTime.reroot: with method or node: least-squares

1.96    TreeTime.reroot: rerooting will ignore covariance and shared ancestry.

2.16    TreeTime.reroot: with method or node: least-squares

2.16    TreeTime.reroot: rerooting will ignore covariance and shared ancestry.

3.14    ###TreeTime.run: INITIAL ROUND

11.21   TreeTime.reroot: with method or node: least-squares

11.21   TreeTime.reroot: rerooting will ignore covariance and shared ancestry.

11.40   ###TreeTime.run: ITERATION 1 out of 2 iterations

20.41   ###TreeTime.run: ITERATION 2 out of 2 iterations

28.75   ###TreeTime.run: CONVERGED

Inferred GTR model:
Substitution rate (mu): 1.0

Equilibrium frequencies (pi_i):
  A: 0.347
  C: 0.1842
  G: 0.2239
  T: 0.2349
  -: 0.01

Symmetrized rates from j->i (W_ij):
    A   C   G   T   -
  A 0   0.8202  3.0221  0.4694  0.6551
  C 0.8202  0   0.4392  2.9248  0.8487
  G 3.0221  0.4392  0   0.4261  0.7991
  T 0.4694  2.9248  0.4261  0   0.7764
  - 0.6551  0.8487  0.7991  0.7764  0

Actual rates from j->i (Q_ij):
    A   C   G   T   -
  A 0   0.2846  1.0487  0.1629  0.2273
  C 0.1511  0   0.0809  0.5388  0.1563
  G 0.6766  0.0983  0   0.0954  0.1789
  T 0.1103  0.687   0.1001  0   0.1824
  - 0.0065  0.0085  0.008   0.0077  0

Root-Tip-Regression:
 --rate:    8.327e-04
 --r^2:     0.09

--- saved tree as 
     2019-04-05_treetime/timetree.pdf

--- root-to-tip plot saved to  
    2019-04-05_treetime/root_to_tip_regression.pdf

--- alignment including ancestral nodes saved as  
     2019-04-05_treetime/ancestral_sequences.fasta

--- saved divergence times in 
     2019-04-05_treetime/dates.tsv

--- tree saved in nexus format as  
     2019-04-05_treetime/timetree.nexus

--- divergence tree saved in nexus format as  
     2019-04-05_treetime/divergence_tree.nexus

You can see that this runs pretty quickly on a small dataset of 110 pol sequences. Looking at the output you might think wow this is bad!!! An R^2 of 0.09 and a rate of 8.327e-4 which is way off ~2.0e-3 commonly used for HIV-1 pol subtype C. Though looking at the tree in the pdf treetime.pdf I’m happy! A tMRCA of ~1978 corresponds with what I expect for this cluster based on Marco and my last paper in Sci Reports. We might have to play around with some of the parameters if we find a weird tree.

Convert output to nwk format

So the dating prodced a folder called treetime-. In this directory treetrime writes all its output. There are two tree files: called treetime.nexus and divergence_tree.nexus. You will be using the one called treetime.nexus. Now as you can see from the name its in a nexus format and TreeTime used newick format. So we need to convert this nexus file into a newick file. I do this in a simple text editor. Open the file in TextWrangler. You should see something like this:

#NEXUS
Begin Taxa;
 Dimensions NTax=104;
 TaxLabels ZA_DR_GT1853_2011.663 ZA_DR_GT2320_2012.375 ZA_DR_WC404_2014.121... 
End;
Begin Trees;
 Tree tree1=(((ZA_DR_GT1853_2011.663:30.53422... ...0.10000[&date=1978.05];
End;

Copy everyting after Tree tree1= and the rest of the line to a new file and save that as lets say 23106.treetime.nwk in the working directory.

Mugration analyses

OK! Now we ready for the ancestral state reconstruction. Here we are intrested in a basic mugration analyses between two or more discreet states. In my example these states are different provinces of South Africa (i.e. KZN, GT, WC… ect.). For you it will be only two states (FL vs rest/outside). So we do this with the mugration pacckage of treetime like so:

treetime mugration --tree <input.nwk> --states <states.csv> --attribute <field>

So in my example the attribute field is region:

treetime mugration --tree 23106.treetime.nwk --states 23106.metadata.csv --attribute region

which produces this:

Completed mugration model inference of attribute 'region' for 23106.treetime.nwk

Saved inferred mugration model as: 2019-04-05_mugration/GTR.txt
Saved annotated tree as: 2019-04-05_mugration/annotated_tree.nexus

Now you have a dated tree with inferred ancestral states based on the locations of the external tips of your tree.

Inferring the number of transistions

OK! So David and I are working on a python script that will read in the tree and write time of each transition event from one state to another to a file. Should be done very shortly.