TutorialScript

Tutorial

# Step 0: Activate the claatu environment
conda activate claatu

#Step 1A: Prepare the Tree
python ../bin/prep_tree.py bacteria.tre -mid -up_bi -nbs
head new_prepped_tree.tre

#Step 1B: Count the CTU Matrix
python ../bin/count_tree.py otu.txt new_prepped_tree.tre ctus.txt
head ctus.txt

#Step 2A: Get Taxonomy
python ../bin/clade_stat.py new_prepped_tree.tre tax.txt . -p cladeStat
head cladeStat_nodes2tax.txt

#Step 2B: Get Taxonomy
python ../bin/tax_parser.py cladeStat_nodes2tax.txt tax_clades.txt
head tax_clades.txt

#Step 3: Get Clade Stats
python ../bin/node_info.py new_prepped_tree.tre . -p node_info
head cladeStat_clade_size.txt
head node_info_dist_median.txt
head node_info_dist.txt
head node_info_levels.txt

#Step 4A: Get Clade Signficance across all samples
python ../bin/ptest_tree.py otu.txt new_prepped_tree.tre ptest.txt -p 100
head ptest.txt
head ptest.txt_stats.txt

#Step 4B: Get Clade Significance across groups
python ../bin/ptest_tree.py otu.txt new_prepped_tree.tre groupPtest.txt -p 100 -g groupMap.txt
head groupPtest.txt
head groupPtest.txt_stats.txt

Visualization of ClaaTU output with R

ggtree is a phylogenetic tree viewer for different types of annotations. Here we will perform a few different visualizations of the data that we just got out of ClaaTU. Below, I’ve written comments to annotate our code. These start with a pound sign “#”. Note that blocks of code surrounded by a beige color are code that you type in, while text surrounded by a white box is example output of R.

# ggtree
#I should have alread installed ggtree for you, but if it wasn't installed for some reason, you will need to type something like this: 
#source("https://bioconductor.org/biocLite.R") 
#try http:// if https:// URLs are not supported
#biocLite("ggtree")

#ape
# I should have also installed this already for you, but if it isn't for some reason type: 
#install.packages(“ape”)

#Note you will have to remove the pound sign on these lines of code to install.

Next, load the libraries that we will use.

library("ape")
library("ggplot2")
library("ggtree")

## Warning: package 'ggtree' was built under R version 3.5.1

## ggtree v1.12.7  For help: https://guangchuangyu.github.io/software/ggtree
## 
## If you use ggtree in published research, please cite:
## Guangchuang Yu, David Smith, Huachen Zhu, Yi Guan, Tommy Tsan-Yuk Lam. ggtree: an R package for visualization and annotation of phylogenetic trees with their covariates and other associated data. Methods in Ecology and Evolution 2017, 8(1):28-36, doi:10.1111/2041-210X.12628

## 
## Attaching package: 'ggtree'

## The following object is masked from 'package:ape':
## 
##     rotate

Plotting basics with ggtree

ggtree extends ggplot if you are familiar with it. Here are two ways to plot a basic tree using ggtree, and some basic parameters to manipulate to get different plots.

## Plot Trees
#Read the tree into R
t = read.tree('bacteria.tre')

# ggtree extends ggplot2 by adding a geom_tree layer. If you are familiar with ggplot, then maybe you prefer to plot a ggtree this way. If not, then this is more complicated and you can ignore it. 
ggplot(t, aes(x,y)) + geom_tree() + theme_tree()

#You can plot the tree the easy way by just typing:
ggtree(t)

#maybe add some color?
ggtree(t, color = "darkseagreen", size = 1, linetype = "dotted")

#Maybe we'd like to get rid of branch lengths
ggtree(t, branch.length="none")

# Change the layout
ggtree(t, layout="circular", branch.length="none")

# Display some nodes and tip labels
ggtree(t, layout = "circular", branch.length="none") + geom_point(aes(shape=isTip, color=isTip), size=2)

Labeling Nodes: Label CTUs significantly conserved across mammals.

Ok, we have plotted a basic ggtree. Now, let’s place some data that we’ve gotten from ClaaTU with it. This first bit of code will take all the significant clades of microbes conserved across mammalian samples, and label only these conserved CTUs with a dot.

# Read in the pValue table
pVals = read.table("ptest.txt_stats.txt", row.names = 1)

# Label the columns
colnames(pVals) = c("obsCore", "meanShuffCore", "sdShuffCore", "Z", "p")

# Select significant nodes. I've chosen a p value of 0.05 here. For the purposes of the tutorial, we didn't conduct for muliple hypothesis testing. You may want to consider this for your own data.
allSig = rownames(pVals)[which(pVals$p < 0.05)]
N = length(t$tip.label) + length(t$node.label)

# Make a dataframe that we will attach to our tree
d = data.frame(node = 1:N,  nodeName = c(t$tip.label, t$node.label),  pValue = rep (NA, N))
rownames(d) = d$nodeName
d[allSig, 'pValue'] = 'sig'

# Attach the data frame to the tree
p = ggtree(t, layout = 'circular', branch.length = 'none') %<+% d

#Plot Tree. Note because we only ran 100 permutations for the tutorial, and since I only gave you a subset of the data, there are only a few CTUs that were conserved across all animals.
p + geom_point2(aes(subset = (pValue == "sig"), color = pValue), size = 2)

To recap, we have annotated the bacterial phylogenetic tree with our CTUs. These CTUs are clades of microbes which are sigificantly more conserved to mammals than expected by chance.

Labeling Nodes: Annotate CTUs conserved across host taxonomic groups

Next, we might also want to visualize those CTUs which were conserved across mammalian taxonomic groups. Remember, we had three different taxonomic orders of animals. These taxonomic orders were used for our our group permutation test. Lets visualize where we find these on the bacterial tree.

# Read in the groups permutation test, and label the columns
gps = read.table('groupPtest.txt_stats.txt', row.names = 1)
colnames(gps) = c('obsPrev', 'meanShuffPrev', 'sdShuffPrev', 'Z', 'p')
head(gps)

##                       obsPrev meanShuffPrev sdShuffPrev        Z      p
## node5256_Carnivora      0.333         0.727      0.1640  -2.4000 0.9920
## node6295_Artiodactyla   0.667         0.297      0.1920   1.9200 0.0272
## node3545_Primates       0.222         0.231      0.1500  -0.0591 0.5240
## node5119_Primates       0.444         0.316      0.1600   0.8050 0.2100
## node149_Carnivora       0.167         0.655      0.1720  -2.8400 0.9980
## node13_Carnivora        0.167         0.972      0.0669 -12.0000 1.0000

## a couple of helper functions so that we can split our rownames into nodeID and animal taxonomy
# Return the roup of the rownames
returnGroup = function(v){  
  group = strsplit(x = v, split = '_', 2)[[1]][2]  
  return(group)
}

# Return the nodeName of the rownames
returnNode = function(v){  
  name = strsplit(x = v, split = "_", 2)[[1]][1]
}

# Now, split our rownames, to make a group and node column
gps$node = sapply(rownames(gps), returnNode)
gps$group = sapply(rownames(gps), returnGroup)
head(gps)

##                       obsPrev meanShuffPrev sdShuffPrev        Z      p
## node5256_Carnivora      0.333         0.727      0.1640  -2.4000 0.9920
## node6295_Artiodactyla   0.667         0.297      0.1920   1.9200 0.0272
## node3545_Primates       0.222         0.231      0.1500  -0.0591 0.5240
## node5119_Primates       0.444         0.316      0.1600   0.8050 0.2100
## node149_Carnivora       0.167         0.655      0.1720  -2.8400 0.9980
## node13_Carnivora        0.167         0.972      0.0669 -12.0000 1.0000
##                           node        group
## node5256_Carnivora    node5256    Carnivora
## node6295_Artiodactyla node6295 Artiodactyla
## node3545_Primates     node3545     Primates
## node5119_Primates     node5119     Primates
## node149_Carnivora      node149    Carnivora
## node13_Carnivora        node13    Carnivora

# Get the significant nodes for each group. Note for the purposes of this tutorial I've chosen a p value of 0.05. For your own data I would consider doing multiple hypothesis testing correction.
car = gps[which(gps$p<0.05 & gps$group == "Carnivora"), "node"]
pri = gps[which(gps$p<0.05 & gps$group == "Primates"), "node"]
art = gps[which(gps$p<0.05 & gps$group == "Artiodactyla"), "node"]

# Make a data frame to attach to the tree.
d = data.frame(node = 1:N, nodeName = c(t$tip.label, t$node.label),
               car = rep (NA, N), 
               pri = rep(NA, N),
               art = rep(NA, N))
rownames(d) = d$nodeName

# Assign a value to significant nodes. Each CTU will be asigned a three letter string
d[car, "car" ] = "car"
d[pri, "pri"] = "pri"
d[art, "art"] = "art"

#Plot the tree and color the significant nodes
p = ggtree(t, layout = "circular", branch.length = "none") %<+% d
p = p + geom_point2(aes(subset = (car == "car"), color = car), size = 4, alpha = .7)
p = p + geom_point2(aes(subset = (pri == "pri"), color = pri), size = 4, alpha = .7)
p = p + geom_point2(aes(subset = (art == "art"), color = art), size = 4, alpha = .7)
p

To recap here: we have annotated the bacterial phylogenetic tree with CTUs which are conserved to animal orders.

Annotating Branches: Color each branch by clade coreness value

So far we have annotated nodes using ggtree. Next, here is an example on how we can annotate a phylogeny edges with information. For each node we have an associated coreness value (how many samples that node occurs in). Lets color each branch with the coreness of the clade. A phylogeny heat map if you will. Note that if you have an OTU table output from ClaaTU you will have to remove the “#OTU ID” in front of it to be able to read the table in correctly to R.

#!!!! MAKE SURE YOU GET RID OF THE "#OTU ID" in front of the OTU table

# We will need a coreness value for each otu as well, so lets read that in and calculate OTU coreness.
otu = read.table("otu.txt", header = T)
otu[otu>0] = 1/ncol(otu)
otu = apply(otu, 1, sum)

# now, make a dataframe with coreness values for each clade and tip to attach to the tree
d = data.frame(nodeName = c(t$tip.label, t$node.label), core = c(otu[t$tip.label], pVals[t$node.label, "obsCore"]))

# now, plot the tree.
p = ggtree(t, layout = "circular", branch.length = "none") %<+% d
p = p + aes(color = core) + scale_color_gradientn(name = "Clade Coreness", colours = rainbow(7)) 
p = p + theme(legend.position = "right")
p