building methanogen ref tree

Custom tree described in this document:

Placing N20-repressed methanogens into methanogen reference tree

General strategy: Place our 16S gene fragments identified as coming from methanogenic taxa into a 16S methanogen reference tree. The purpose is to demonstrate to what extent we have demonstrated repression of methanogens during exposure to N20 1. gather 16S genes from demonstrated methnogenic taxa 2. cluster using CD-hit; not sure what level, to build a reference library 3. place fragments into the reference tree using RaxML 4. visualize using IToL and Illustrator

Gathering methanogen reference sequences

Source literature: https://www.nature.com/articles/s41579-018-0136-7 https://link.springer.com/referenceworkentry/10.1007%2F978-3-540-77587-4_42

Since the nature paper is more recent, I will focus on it. They cover all methane metabolism, not just methanogenesis. They present this tree

In this tree, Most of the highlighted taxa are at least potential methanogens, i.e. they have members with mcrA. The different colors indicate differnet putative metabolic systems, generally similar though. They say that this is build on all publically available Archaeal genomes. It appears they are labelling Orders and Families somewhat interchangably, perhaps even phyla. In the end, it is not overly important. I will gather all 16S Archaeal genes from IMG, build a reference set, build a proper tree, and label accordingly.

Gather sequences from IMG

There are 1852 Archaeal genomes in IMG system which are publically accessible. Many of these are MAGs and SAGs and thus lack any detected 16S genes. I gathered all 16S genes from these genomes, 1421 in total, and saved.

Two downloads result from this, a gene table, which tracks genomed ID and taxonomy

For the tree, I will need to keep clean track of taxonomy. This can be accomplished by using turning the gene table into several metadata files for IToL, colored ranges are easiest.

Processing the Archaeal SSU set

First the 1421 genes are filtered based on the IMG site listing of DNA sequence length. 16S genes are often subject to truncation/partial assembly in all sorts of genomes due to sequence conservation. What is the “true” size distribution of Archaeal 16S genes? Visual inspections of assembly context of 16S genes in the full gene set reveals that around 1300 bp, there is a mix of genes on the periphery/edge of a contig, which can indicate truncation, and intra-contig location, indicating full-length.

How long are Archaeal 16S genes really??

Investigating the Silva SSU database reveals two things: 1. most sequences in the archives are partial 2. the longest genes are ~1400bp or slightly less.

To build a strong tree, a 1300bp size cutoff is reasonable.

Inspecting Methanosarcina in Silva also shows ~1420bp full genes.

apply size filter to IMG SSU set.

using a 1300 bp lower limit leaves 806 sequences.

filter out fully

Clustering the Archaeal SSU set down to a representative set

Clustering executed at 99%, 97%, 94%, 90%, 80%. The first three levels represent approximate clustering by strain, species, and genus.

These lead to:

cluster.percent = c(99,97,94,90,80)
n = c(385,291,237,173,79)
cluster.plot = data.frame(cluster.percent,n)
cluster.plot

##   cluster.percent   n
## 1              99 385
## 2              97 291
## 3              94 237
## 4              90 173
## 5              80  79

The 97% set is reasonable, no justification for going down to strain level.

cleaning the fasta headers

the fasta headers in the cluster set are still long and will cause problems downstream in the pipeline. Simplest thing to do is to remove everything after the first space. This leaves the IMG gene ID, which can then be populated more reasonably.

header cleaning can be handled easily using a terminal-based “sed” command:

$ sed -i ’s/.*$//g’ SSU_97

I will then import this, import the gene data, build better fasta headers, and export

library(Biostrings)

## Loading required package: BiocGenerics

## Loading required package: parallel

## 
## Attaching package: 'BiocGenerics'

## The following objects are masked from 'package:parallel':
## 
##     clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
##     clusterExport, clusterMap, parApply, parCapply, parLapply,
##     parLapplyLB, parRapply, parSapply, parSapplyLB

## The following objects are masked from 'package:stats':
## 
##     IQR, mad, sd, var, xtabs

## The following objects are masked from 'package:base':
## 
##     anyDuplicated, append, as.data.frame, basename, cbind,
##     colMeans, colnames, colSums, dirname, do.call, duplicated,
##     eval, evalq, Filter, Find, get, grep, grepl, intersect,
##     is.unsorted, lapply, lengths, Map, mapply, match, mget, order,
##     paste, pmax, pmax.int, pmin, pmin.int, Position, rank, rbind,
##     Reduce, rowMeans, rownames, rowSums, sapply, setdiff, sort,
##     table, tapply, union, unique, unsplit, which, which.max,
##     which.min

## Loading required package: S4Vectors

## Loading required package: stats4

## 
## Attaching package: 'S4Vectors'

## The following object is masked from 'package:base':
## 
##     expand.grid

## Loading required package: IRanges

## Loading required package: XVector

## 
## Attaching package: 'Biostrings'

## The following object is masked from 'package:base':
## 
##     strsplit

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:Biostrings':
## 
##     collapse, intersect, setdiff, setequal, union

## The following object is masked from 'package:XVector':
## 
##     slice

## The following objects are masked from 'package:IRanges':
## 
##     collapse, desc, intersect, setdiff, slice, union

## The following objects are masked from 'package:S4Vectors':
## 
##     first, intersect, rename, setdiff, setequal, union

## The following objects are masked from 'package:BiocGenerics':
## 
##     combine, intersect, setdiff, union

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

refset <- Biostrings::readDNAStringSet("trees/SSU_ref_seqs/SSU_97")
refset.names <- names(refset)
refset.seqs <- paste(refset)
refset.l <- width(refset)
refset.template <- data.frame(refset.names, refset.seqs, stringsAsFactors = F)
colnames(refset.template) <- c("ID", "seq")
gene.data <- read.csv("trees/SSU_ref_seqs/SSU_full_set_genedata.csv", sep = "\t", stringsAsFactors = F)
gene.data <- data.frame(as.character(gene.data$gene_oid), gene.data$Genome.Name, stringsAsFactors = F)
colnames(gene.data) <- c("ID", "genome.name")
new.names.build <- dplyr::left_join(refset.template, gene.data, by = "ID")

#cleaning up, removing special characters
new.names.build$name <- paste(new.names.build$ID, new.names.build$genome.name, sep = "_")
new.names.build$name <- gsub(" ", "_", new.names.build$name)
new.names.build$name <- gsub("\\.", "_", new.names.build$name)
new.names.build$name <- gsub(",", "_", new.names.build$name)
new.names.build$name <- gsub(";", "_", new.names.build$name)
new.names.build$name <- gsub(":", "_", new.names.build$name)
new.names.build$name <- gsub("\\(", "", new.names.build$name)
new.names.build$name <- gsub("\\)", "", new.names.build$name)
cleaned.ref.set <- data.frame(new.names.build$name, new.names.build$seq)
colnames(cleaned.ref.set) <- c("name","seq")

I now have a data table which will be turned into a multifasta (cleaned.ref.set) AND a data frame which will be used to build IToL formatted metadata files down the road, which have both the new seq header and the old IMG ID, which can link to the genome database and provide taxonomic metadata if required.

Writing the new fasta

#custom function for writing data table to fasta:
writeFasta<-function(data, filename){
  fastaLines = c()
  for (rowNum in 1:nrow(data)){
    fastaLines = c(fastaLines, as.character(paste(">", data[rowNum,"name"], sep = "")))
    fastaLines = c(fastaLines,as.character(data[rowNum,"seq"]))
  }
  fileConn<-file(filename)
  writeLines(fastaLines, fileConn)
  close(fileConn)
}

writeFasta(cleaned.ref.set, "trees/SSU_ref_seqs/SSU_97_clean")

Aligning, degapping and building an ML tree

$ mafft –auto SSU_97_clean > SSU_97.align $ /home/rmurdoch/Tools/trimAl/source/trimal -in SSU_97.align -out SSU_97.trim -strict # the gaps that remain might bias the tree, keep an eye out ###$ iqtree -s SSU_97.trim -st DNA -m TEST –bb 1000 –alrt 1000#this runs a substitution model test and applies it, but does it bootstrap? ##IW-tree is incompatible with the pipeline, switching to RaxML $ raxmlHPC -f a -x 12345 -p 12345 -# autoMRE -m GTRGAMMAI[X]
-s SSU_97.trim -n SSU_97_reftree

This yields a fairly reasonable tree, although there are some odd placements; A group of Methanosarcinales sequences places as an outgroup. The Heimdall phyla do not group together. Overall I don’t care!

I need to make metadata files which will help to discriminate now; for each sequence ID, link to; 1. group 2. phylum 3. class 4. order 5. family

That just means 1. doing a left-join to “new.names.build” from the gene data table to add in genome ID 2. left-joining in taxonomic info from genome data table

library(dplyr)
gene.data <- gene.data <- read.csv("trees/SSU_ref_seqs/SSU_full_set_genedata.csv", sep = "\t", stringsAsFactors = F)
gene.to.genome <- data.frame(as.character(gene.data$gene_oid), as.character(gene.data$Genome.ID))
colnames(gene.to.genome) <- c("ID","Genome.ID")
gene.to.leaf <- new.names.build[c(1,4)]
leaf.to.genome <- dplyr::left_join(gene.to.leaf, gene.to.genome, by="ID")

## Warning: Column `ID` joining character vector and factor, coercing into
## character vector

head(leaf.to.genome)

##           ID
## 1 2264947523
## 2 2265091464
## 3 2265093822
## 4 2524426154
## 5 2524425879
## 6 2612179386
##                                                                         name
## 1 2264947523_Aenigmarchaeota_archaeon_SCGC_AAA011-F07_contamination_screened
## 2    2265091464_Aigarchaeota_archaeon_SCGC_AAA471-G05_contamination_screened
## 3    2265093822_Aigarchaeota_archaeon_JGI_0000106-J15_contamination_screened
## 4                                                    2524426154_NAG2_mk4-r01
## 5                                                    2524425879_NAG2_mk4-r01
## 6                                           2612179386_Metallosphaera_sedula
##    Genome.ID
## 1 2264867064
## 2 2264867224
## 3 2264867229
## 4 2524023200
## 5 2524023200
## 6 2609460254

Then to this information, I do step 2, join in the taxonomy using the Genome.ID

library(dplyr)
genome.data <- read.csv("trees/SSU_ref_seqs/IMG_genome_data.csv", sep = "\t", stringsAsFactors = F)
genome.data <- data.frame(as.character(genome.data$taxon_oid), genome.data$Phylum, genome.data$Class, genome.data$Order, genome.data$Family)
colnames(genome.data) <- c("Genome.ID", "Phylum", "Class", "Order", "Family")
leaf.to.tax <- dplyr::left_join(leaf.to.genome, genome.data, by="Genome.ID")
write.csv(leaf.to.tax, "trees/SSU_ref_seqs/leaf.to.taxonomy.csv")
head(leaf.to.tax)

##           ID
## 1 2264947523
## 2 2265091464
## 3 2265093822
## 4 2524426154
## 5 2524425879
## 6 2612179386
##                                                                         name
## 1 2264947523_Aenigmarchaeota_archaeon_SCGC_AAA011-F07_contamination_screened
## 2    2265091464_Aigarchaeota_archaeon_SCGC_AAA471-G05_contamination_screened
## 3    2265093822_Aigarchaeota_archaeon_JGI_0000106-J15_contamination_screened
## 4                                                    2524426154_NAG2_mk4-r01
## 5                                                    2524425879_NAG2_mk4-r01
## 6                                           2612179386_Metallosphaera_sedula
##    Genome.ID          Phylum        Class        Order        Family
## 1 2264867064 Aenigmarchaeota unclassified unclassified  unclassified
## 2 2264867224    Aigarchaeota unclassified unclassified  unclassified
## 3 2264867229    Aigarchaeota unclassified unclassified  unclassified
## 4 2524023200    unclassified unclassified unclassified  unclassified
## 5 2524023200    unclassified unclassified unclassified  unclassified
## 6 2609460254   Crenarchaeota Thermoprotei Sulfolobales Sulfolobaceae

At this stage it is best to move into Excel and take care of metadata files myself.

Initial reference Tree

Exported from IToL, colors and text added in Illustrator

Overall topology is same as the Nature pub, however, the Asgardia (purple) and a subset of Euryarchaeota (on the far left) are mis-placed. I have no solution for this, aside from simply removing the troublesome sequences.

How to add sub-clade information?

These could be added as colored ranges and the larger divisions changed to a neutral color. The fragments which are added to the tree will be given a brigt red color and thick line, hopefully that will make them more visible.

Extracting relevant fragments from the 16S amplicon library dataset

This can be done elegantly with Phyloseq package, but I will just use the full OTU table that I build via a custom R script in every Qiime2 run.

These will be fed down the placement pipeline that I used for the NosZ ROCKER fragment visuazlation that I used ages ago. I can split them by treatment… or something, this will be node-based metadata ultimately… but first I have to place the fragments.

All archaeal sequences are subsetted, placed into a csv, and a fasta file is built:

KUB.arch <- read.csv("trees/KUB_archaea/all.KUB.Archaea.csv")
KUB.arch <- KUB.arch[c(2,ncol(KUB.arch))]
colnames(KUB.arch) <- c("name","seq")
writeFasta(KUB.arch, "trees/Fragment_placement/KUB_arch.fa")

Adding Archaeal SSU gene fragments to the Arhcaeal reference tree

This is accomplished using a mafft / raxml -f v / pplacer pipeline described in Kostas’s ROCker paper.

$ mafft –addfragments trees/Fragment_placement/KUB_arch.fa –reorder –thread -1 trees/SSU_ref_seqs/SSU_97.trim > trees/Fragment_placement/SSU97_addfrag.align $ raxmlHPC -f v -m -m GTRGAMMAI -p 23 -t trees/SSU_ref_seqs/IQTree_jan30/SSU_97.trim.treefile -s trees/Fragment_placement/SSU97_addfrag.align -n trees/Fragment_placement/SSU97_fragree

These two commands place the fragments into the reference tree, yielding a jplace file. The jplace can be visualized immediately via IToL, however there is no way to incorporate metadata by this route (reads per sequence, reads per treatment)

This jplace file contains the reference tree and placement locations and volumes. This can ultimately be parsed out using a pplacer pipeline or by manually parsing the jplace, but this is adequate for now.

This was exported from IToL mostly unmodifed and then annoated using Illustrator. Blue-shaded clades represent major groups of methanogens (proven or speculated). A handful of minor methanogenic taxa were excluded (methanofastidiosa and versta… something). These groups are very under-represented in the databases, yet if there were there and are actually their own Phyla/Classes, they should have yielded reference sequences following clustering. Red circles represent placement numbers and locations of the Archaeal fragments from the methanogeneic reactor project; all were repressed by N2O addition.