Hadesarchaota

1 Hadesarchaeota in Saudi Arabian inland sabkha (hypersaline sandflat)
2 Gathering reference 16S set
3 reference tree construction
- 3.1 Building reference tree
4 extracting Hadesarchaota from the data set
5 adding new fragments
- 5.1 resulting tree
- 5.2 What is L’Atalante brine?
6 Conclusions

1 Hadesarchaeota in Saudi Arabian inland sabkha (hypersaline sandflat)

Acetothermia and Hadesarchaeaeota phyla were at occasionally notable abundances in the inland sabkha libraries. Hadesarchaeota is a particularly understudied Archeal phylum.

Populations of Hadesarchaeota have been found in variable part of the world as in a South African gold mine, in the White Oak river estuary and in the Yellowstone National Park. These are all habitats of extreme conditions with a temperature of about 70 °C and highly alkaline environment.  The organisms are strongly related to other anaerobic bacteria, can oxidize CO to CO2, liberate molecular hydrogen and seem to represent an important taxonomical unit of cosmopolitan subsurface archaea. (from https://link.springer.com/article/10.1007/s42452-019-0874-9)

This seems to be the first reported example of Hadesarchaeota in an aerobic, surface habitat. In the above quoted paper, traces of Hades were detected in mine trailings at the surface, presumably having been transported to the surface by mining activities.

This Nature Micro paper (https://www.nature.com/articles/nmicrobiol20162) examines this group and proposed the name “Hadesarchaeota” to replace SAGMEG (South-African Gold Mine Miscellaneous Euryarchaeal Group). The name “hades” reflects the apparent restriction to subsurface systems.

Genome content analyses have suggest a wide range of metabolic lifestyle options, including hydrogen production, CO oxidation (an anaerobic process), DNRA, sugar fermentation, and employment of (at least) pentose phosphate and WL-pathways for carbon fixation. Primary lifestyle might be CO oxidation coupled to nitrate reduction. It is presumed that this is a strict anaerobic organism, although there is no mention of, for example, oxygen stress amelioration potential.

This tree both details the types of environments that Hades have been isolated from and offers a large set of 16S genes with which to build a new phylogentic tree. The main thrust of this rmd document is to record methods for building the Hadesarchaeota tree and populating with our new gene fragments. Unfortunately, there is no list or table of these genes provided, so it will have to be constructed manually.

2 Gathering reference 16S set

2.1 make custom database of ref seqs

refset <- read.csv("nature.hades.ref.set.csv",stringsAsFactors = F)
datatable(refset)

2.2 download

retreive using Batch Entrez (https://www.ncbi.nlm.nih.gov/sites/batchentrez)? No, this simply does not work, most seq/fasta download options fail, I assume due to the fact that these are rRNA rather than CDS.

Try the silva system; convert silva accession numbers into a csv list and bulk search. No problem!

library(Biostrings)
source("/home/robert/Dropbox/R.source/RWM.functions.R")

ref.seqs <- readDNAStringSet("hades.ref.set.fasta")
name.orig <- names(ref.seqs)

#simple function to split by "." and take the first element
spcSplit <- function(x){
  strsplit(x,"\\.")[[1]][1]
}

name <- unlist(lapply(name.orig, spcSplit))

seq <- paste(ref.seqs)
ref.seq.table <- data.frame(name, seq)

#we will concatenate seq description onto the accession number so we know what we are looking at
library(dplyr)

colnames(refset) <- c("description","name","environment")

refset <- left_join(refset,ref.seq.table,by="name")

colnames(refset)[2] <- "NCBI"
refset$name <- paste(refset$NCBI,refset$description)

#replace spaces with underscores so that we don't lose descriptions during the pipeline
library(stringr)
refset$name <- str_replace_all(refset$name," ","_")

datatable(refset)

#write the fasta file; writeFasta function automatically grabs "name" and "seq" columns
hades.ref.fasta <- writeFasta(refset, "hades.ref.fasta")

2.3 Check sequence lengths

An important decision point here is whether to 1. align the new sequences with the ref set 2. build full tree with the ref set and then go through an EPA pipeline

refset$seq <- as.character(refset$seq)
lens <- unlist(lapply(refset$seq,nchar))
hist(lens)

We can see that minimum seq length is ~700; therefore use option 2. However, if the placements are very DEEP, then we should consider going with option 1 also.

3 reference tree construction

3.1 Building reference tree

Use a standard pipeline, modified for DNA rather than protein (https://github.com/rwmurdoch/project.scripts/blob/master/MAFFT_trimal_raxML.bootstrap.sh)

align with auto settings (small set will be subjected to L-INS-i (Probably most accurate, very slow))
remove major gaps
fun proper bootstrap tree

mkdir align.tree

mafft --auto hades.ref.fasta > align.tree/hades.ref.align.fasta

trimal \
-in align.tree/hades.ref.align.fasta \
-out align.tree/hades.ref.align.trim.fasta \
-gappyout

cd align.tree

raxmlHPC -f a -m GTRGAMMA -p 23 -x 23 -T 6 -s hades.ref.align.trim.fasta -n hades.ref.tree -# autoMRE

Everything operated in conda environment

4 extracting Hadesarchaota from the data set

Import the full OTU table, first narrow down to all Hades phylum sequences to get a sense of count and distribution

OTU.all <- read.csv("../combined.sabkha.analysis/exports/OTU_table.csv",stringsAsFactors = F)
#simple function to split by ";" and take the second element
spcSplit <- function(x){
  strsplit(x,";")[[1]][2]
}
OTU.all$phylum <- unlist(lapply(OTU.all$Taxon,spcSplit))

#make a subtable that only has Hadesarchaeaeota
OTU.hades <- subset(OTU.all, phylum == "D_1__Hadesarchaeaeota")
OTU.hades$sum <- rowSums(OTU.hades[c(3:31)])

#remove anything with fewer than 10 occurences
OTU.hades.rel <- subset(OTU.hades, sum >9)

datatable(OTU.hades.rel)

There are 7 of these fragments. check the frequency distribution:

hist(OTU.hades.rel$sum, breaks =  10)

We can make a weighted placement by simply making replicate reads based on count and then feeding that into the EPA

inflate <- OTU.hades.rel
for (x in 1:nrow(inflate)) 
{
for (y in 1:(inflate$sum[x]-1)) 
{
inflate <- rbind(inflate,inflate[x,])
}
}
#make sure that the simple fragment list is written also
colnames(OTU.hades.rel)[c(2,35)] <- c("name","seq")
writeFasta(OTU.hades.rel,"hades.frags.10.fasta")
write.csv(OTU.hades.rel, "OTU.hades.rel.csv")

#prepare for write fasta
colnames(inflate)[c(2,35)] <-c("name.temp","seq")

#make sure that seq names are unique
inflate$row <- 1:nrow(inflate)
inflate$name <- paste(inflate$name.temp,inflate$row,sep = "_")

writeFasta(inflate,"hades.frag.inflate.fasta")

5 adding new fragments

MAFFT –addfragments (https://mafft.cbrc.jp/alignment/software/addsequences.html)
RaxML EPA

mafft --addfragments \
hades.frag.inflate.fasta \
--reorder --thread -1 \
align.tree/hades.ref.align.trim.fasta > align.tree/hades.ref.and.frags.align.fasta 

cd align.tree

raxmlHPC -f v -m GTRGAMMAI -p 23 \
--epa-keep-placements=4 \
-t RAxML_bipartitions.hades.ref.tree \
-s hades.ref.and.frags.align.fasta \
-n hades.frag.tree

The initial tree resulting in large proportion of uncertain placements, which manifest as single fragements being placed all over the tree. The second attempt with the –epa-keep-placements=1 setting yielded…

you can also use the –epa-prob-threshold setting to tweak

5.1 resulting tree

The small red dot is the most likely placement. No information is avaailable about the “closest” relative, but the other has an excellent paper describing the origin.

https://science.sciencemag.org/content/307/5706/121.full AY226363, done in 2005. This was the most dominant 16S clone in a hypersaline axoxic basin called “L’Atalante” , part of candidate division MSBL1 (Mediterranean Sea Brine Lakes group 1). They state that “Our results indicate that microbial metabolism can proceed at significant levels in some of the most extreme terrestrial hyper-saline environments and lend further support to the possibility of extraterrestrial life.”

5.2 What is L’Atalante brine?

Based on info from the Science paper, this was: * 30-60m deep * 13.6 C * 91% of Archaea was MSBL1 division * notable methanognesis (16.93 uM / day) * hyperaline, mostly due to NaCl, although all major ions were at least 10x higher concentration than seawater. Seems to reflect concentrated seawater rather than mineral influence.

Most notable difference is that these brines were anoxic, whereas we were looking at aerobic systems.

6 Conclusions

The Hadesarchaeaeota detected in both systems are most closely related to MSBL1. This group has been previously observed in anaerobic salty groundwaters, but not at aerobic, dry high temperature.