Lokesh, Christa, et al. have identified a set of 12 putative RDases from Archaeal taxa. Up until this point: 1. I demonstrated that one of these, from a Thorarchaeota, nests phylogenetically among bacterial RDases from both a large repository / functional annotation system (EggNOG) and among a set of RDases with proven function 2. Lokesh et al. have demonstrated that the 12 Archaeal sequences seem closely related to one another (vs. a set of ~300 bacterial RDases)
The goal of the present project is to incorporate all 12 Archaeal RDases into the larger project in order to produce informative phylogenetic trees.
Frank has suggested that epoxyqueosine reductase reductase (queG) might be a close relative of putative RDases; this might suggest an alternative role for these iron-sulfur proteins. Lokesh has proposed using queG as an outgroup for tree construction. First we should inspect an alignment of QueG representatives with the relevant RDase sequences, both from repositories and the 12 Archaeal sequences, in order to see if this is appropriate/relevant.
KEGG maintains a queG KO, K18979. There are 1808 protein sequences in the UniProt database: https://www.genome.jp/dbget-bin/get_linkdb?-t+uniprot+ko:K18979
1794 of these were able to be downloaded from UniProt and clustered via CD-hit at 50% identity
## Size No..seq No..clstr X10.149 X150.199 X200.299 X300.999999
## 1 1 73 73 1 0 9 63
## 2 2-9 172 49 1 0 3 45
## 3 10-49 189 8 0 0 0 8
## 4 50-299 1360 9 0 0 0 9
## 5 300-999 0 0 0 0 0 0
## 6 1000-4999 0 0 0 0 0 0
## 7 5000-999999 0 0 0 0 0 0
## 8 Total 1794 139 2 0 12 125
The set is dominated by singleton clusters, but as there is no way to rapidly remove these I will just move forward in the hope that they do cluster appropriately.
Combine all three sets (EggNOG families, 12 Archaeal sequences, and ~120 QueG representatives) into a single multifasta and align via MAFFT –auto, visualize via Pixel.
The Archaeal sequences are at the very top, then RDases, then EggNOG iron sulfur proteins, then QueG reps (with the long names)
It is not clear how appropriate this is, but it appears to not be TERRIBLE.
Degap the alignment using TrimAl -gappyout algorithm:
Looks like a fairly nice alignment (QueG on the bottom)
And then tree via RaxML -m PROTGAMMALG (following some sequence header cleanup to remove illegal characters)
Initial tree (single iteration, no bootstrapping) is consistent with impression from the alignment, no clear relationship between the three larger groups, i.e. the three groups are equally dissimilar. This is reasonable overall however. I will work now towards a proper bootstrapped tree.
First, the fasta headers should be extracted, cleaned, and turned into metadata files and also simply a fasta with cleaner fasta headers.
library(Biostrings)
RDase.QueG <- readAAStringSet("/home/rmurdoch/Dropbox/Projects/Thorarchaeota/queG/RDases.and.QueG.and.Arch.fasta")
RDase.QueG.names <- names(RDase.QueG)
RDase.QueG.seqs <- paste(RDase.QueG)
RDase.QueG.df <- data.frame(RDase.QueG.names, RDase.QueG.seqs)
write.csv(RDase.QueG.names, "names.all.txt")
Metadata files were created using Excel; some mis-placement is evident:
Some of the smaller misplacements will likely be corrected in the true ML tree. However, the misplaced QueG sequences are likely embedded in the KEGG database.
… STILL BOOTSTRAPPING
First, consider the possibility of adding substrate annotations to the large tree
Currently I only have UniProt headers, while the large tree uses gene names. Strip off the headers and get gene names from UniProt system.
library(Biostrings)
RDase.proven <- readAAStringSet("/home/rmurdoch/Dropbox/Projects/Thorarchaeota/queG/uniprot_v2/proven_rdh_uniprot_cleaned_header_v2.fasta")
RDase.proven.names <- names(RDase.proven)
RDase.proven.seqs <- paste(RDase.proven)
RDase.proven.df <- data.frame(RDase.proven.names, RDase.proven.seqs)
write.csv(RDase.proven.names, "/home/rmurdoch/Dropbox/Projects/Thorarchaeota/queG/uniprot_v2/names.proven.txt")
Unfortunately, the majority of these are NOT in the EggNOG sets
Align and trim->
The purple/pink seqs in the middle are the novel RDases, grey is QueG, yellow is archaeal 4Fe4S, blue/green are bacterial 4Fe/4S/RDases.
The bootstrap supports are very poor at most of the basal nodes, making this very difficult to discuss. I will attempt to use a novel updated bootstrap support method described here https://www.nature.com/articles/s41586-018-0043-0#Fig1 and implemented here: https://booster.pasteur.fr/new/
This makes the bootstrap values much more reasonable:
After some work in ITOL and PowerPoint, I have worked up three variations of the large EggNOG RDase tree
Maximum Liklihood tree of the Archaeal RDases described in the present study (indicated by red names and branch-lines), orthologous groups of 4Fe-4S ferredoxin and RDase proteins as defined by the EggNOG 4.5.1 database, and representative epoxyqueosine reductase (QueG) sequences derived from KEGG ortholog family K18979. (For Images 2 and 3) Inset displays taxonomic origin of the most closely related protein sequences in the EggNOG 4.5.1 database. (For Image 1 only) Inset displays taxonomic origin and names of the most closely related proteins in the EggNOG 4.5.1 database. Bootstrap values above 0.7 are indicated by circles, the size of which are proportional with value.
BAJATHOR (insert appropriate gene locus/name here) was queried against the EggNOG 4.5.1 database (http://eggnogdb.embl.de)(REF). Proteins contained in NOGs with similarity to the query at an e-value below 1e-40 were downloaded and retained for analysis; these included othologous groups 08ZS0, 07ZRG, 07Z2F, 07XD4, 06K29, 0KU9K, and arCOG02740. The KEGG ortholog family for epoxyqueosine reductase (K18979), which was selected as an outgroup for phylogenetic analysis, contains 1794 proteins indexed in the UniProt database (REF,REF). These 1794 proteins were clustered at 50% similarity using CD-hit (REF), yielding a representative set of 139 proteins.
The Archaeal RDases, proteins contained in EggNOG orthologous groups were aligned using MAFFT (REF), trimmed using TrimAl via the automated gappyout algorithm (REF), and subjected to tree building using RaxML automated pipeline (-f a). RaxML analysis was conducted using the CIPRES Science Gateway server (REF). Bootstrap values were determined using the transfer bootstrap expectation (TBE) algorithm (REF: https://www.nature.com/articles/s41586-018-0043-0#Fig1), conducted using the the BOOSTER web-server (https://booster.pasteur.fr/new/).