nif_reconcile_pipeline

ABOUT

This project aims at understanding the evolutionary processes that lead to the prescence and abscence of the potential to fix nitrogen in bacterial symbionts of lucinid clams. For this, a pipeline was developed to reconcile the symbiont species tree with their respective nitrogenase gene tree.

SPECIES TREE

Because the concept of species is controversial for bacteria and common methods used for sexual reproducing organisms can not be applied to them, we had to decide on a method that allowed us to get to an equivalent of a species tree. We decided on applying the GTDB workflow, that uses genome trees inferred from an aligned concatenated set of 120 single copy marker proteins for Bacteria.

Dereplicate Metagenome Assembled Genomes (MAGs) with >99% similarity to simplify tree and make it easier to read and interpret

Selection of symbiont MAGs from sampled lucinids and from free-living relatives were stored in input directory
Run dRep v3.2.2

dRep dereplicate drep_bins -g input/*.fa -sa 0.99 -p 32

Obtain alignment of single copy marker proteins from dereplicated MAGs

Run GTDB-Tk classify v1.3.0

gtdbtk classify_wf --genome_dir drep_bins --out_dir gtdb_classify -x fa --cpus 48

Infer the tree

Run IQ-TREE v2.1.1 to obtain a maximum likelihood tree from the GTDB protein alignment

mkdir spp_tree
cd spp_tree
iqtree2 -m TEST -s ../gtdb_classify/gtdbtk.bac120.user_msa.fasta  -B 10000 --alrt 10000 -T AUTO --threads-max 64
cd ..

Manually rooted the tree at outgroup MAG in FigTree and exported in Newick format: spp_tree/gtdbtk.bac120.user_msa.newick

GENE TREES

We obtained gene trees from the nitrogenase operon (nifHDKT) and from recA (single copy core gene), wich was used as a control. First we annotated the MAGs to identify and extract the genes of interest.

Annotate MAGs

Run DRAM v1.2.2 on dereplicated MAGs

mkdir gene_trees
cd gene_trees
conda activate DRAM
DRAM.py annotate -i 'drep_bins/*.fa' -o annotation
conda deactivate

Nitrogenase tree

We built the tree from a concatenated alignment of four genes (nifH, nifD, nifK and nifT) to obtain a stronger phylogenetic signal. Before concatenating the alignments, we compared the trees of the individual genes. We checked that the branches with high support were shared between the gene trees, which would indicate that the genes share an evolutionary history.

Extract nif genes from DRAM annotation and save the extractions for each gene separately

mkdir nif
grep -A1 'nitrogenase' annotation/genes.fna | grep -A1 'NifH' | sed '/^--/d' >> nif/nifH_extraction.fna
grep -A1 'NifT' annotation/genes.fna | sed '/^--/d' >> nif/nifT_extraction_drep.fna
grep -A1 'nitrogenase' annotation/genes.fna | grep -A1 'alpha' | sed '/^--/d' >> nif/nifD_extraction.fna
grep -A1 'nitrogenase' annotation/genes.fna | grep -A1 'beta' | sed '/^--/d' >> nif/nifK_extraction.fna

Rename headers

Create a text file with the names of the MAGs

ls ../drep_bins/*.fa | sed 's/.fa//g' >> drep_mags.txt

Rename headers from the original DRAM output, to the name of the MAGs they were annotated from with FastA.rename.pl from Enveomics

mkdir nif/rename
cd nif/rename
cp ../../drep_mags.txt .

for i in H D K T;
do mv ../nif${i}_extraction.fna . ;
grep '>' nif${i}_extraction.fna | sed 's/>//g' | sort >> headers_nif${i}.txt ;
cat drep_mags.txt | while read line; do grep -o $line nif${i}_extraction.fna; done > drep_mags_nif${i}.txt;
awk '{print $0"_nif"}' drep_mags_nif${i}.txt >> new_headers_nif${i}.txt;
paste -d '\t' <(sort headers_nif${i}.txt) <(sort new_headers_nif${i}.txt) >> rename_nif${i}.txt ;
FastA.rename.pl rename_nif${i}.txt nif${i}_extraction.fna > nif${i}_renamed.fna;
done

Remove duplicated sequences coming from the same MAG with SeqKit v2.3.0 and align with FSA v1.15.9

mkdir ../alig
cd ../align

for i in H D K T;
do cp ../rename/nif${i}_renamed.fna;
seqkit rmdup -n nif${i}_renamed.fna -o nif${i}_renamed_uniq.fna;
time fsa --fast nif${i}_renamed_uniq.fna > nif${i}_fsa_renamed_uniq.fa;
done

Concatenate alignments with SeqKit v2.3.0

seqkit concat nif*_fsa_renamed_uniq.fa > nifHDKT_fsa_renamed_uniq.fna

Infer trees for individual genes

run IQ-TREE v2.1.1 on each alignment

mkdir ../tree
cd ../tree

for i in H D K T ;
do cp ../align/nif${i}_fsa_renamed_uniq.fa . ;
iqtree2 -m TEST -s nif${i}_fsa_renamed_uniq.fa -B 1000 --alrt 10000 -T AUTO --threads-max 24 ;
done

checked visually in FigTree that branches with high support values match between the four trees, meaning that the four genes probably have been transfered all together.

Infer tree from concatenated alignment of nifHDKT

Run IQ-TREE v2.1.1 on concatenated alignment

cp ../align/nifHDKT_fsa_renamed_uniq.fna
iqtree2 -m TEST -s nifHDKT_fsa_renamed_uniq.fna -B 1000 --alrt 10000 -T AUTO --threads-max 24

Manually rooted the tree ( nifHDKT_fsa_renamed_uniq.fna.treefile ) at outgroup MAG in FigTree and exported in Newick format: gene_trees/nif/tree/nifHDKT_fsa_renamed_uniq.newick

RecA tree

Extract recA gene from DRAM annotation and save the extraction

cd ../..
mkdir reca
grep -A1 'RecA' annotation/genes.fna | grep -A1 'recombination' | sed '/^--/d' >> reca/reca_extraction.fna

Rename headers

Rename headers from the original DRAM output, to the name of the MAGs they were annotated from with FastA.rename.pl from Enveomics

mkdir reca/rename
cd reca/rename

cp ../../drep_mags.txt .

grep '>' reca_extraction.fna | sed 's/^>//' | sort >> headers_reca.txt
cat drep_mags.txt | while read line; do grep -o $line reca_extraction.fna; done > drep_mags_reca.txt
awk '{print $0"_recA"}' drep_mags_reca.txt >> new_headers_reca.txt
paste -d '\t' <(sort headers_reca.txt) <(sort new_headers_reca.txt) >> rename_reca.txt
FastA.rename.pl rename_reca.txt reca_extraction.fna > reca_extraction_renamed.fna

Remove duplicated sequences coming from the same MAG with SeqKit v2.3.0 and align with FSA v1.15.9

mkdir ../alig
cd ../align

seqkit rmdup reca_extraction_renamed.fna -o reca_extraction_renamed_uniq.fna
time fsa --fast ../rename/reca_extraction_renamed_uniq.fna > reca_fsa_renamed_uniq.fna

Infer tree from alignment of recA

Run IQ-TREE v2.1.1 on alignment

mkdir ../tree
cd ../tree
cp ../align/reca_fsa_renamed_uniq.fa .
iqtree2 -m TEST -s reca_fsa_renamed_uniq.fna -B 1000 --alrt 10000 -T AUTO --threads-max 24

Manually rooted the tree ( reca_fsa_renamed_uniq.fna.treefile ) at outgroup MAG in FigTree and exported in Newick format: gene_trees/reca/tree/reca_fsa_renamed_uniq.newick

TREE RECONCILIATION

Reconciliation of species tree and gene trees with GeneRax v2.1.0

Create directory and copy rooted species tree, rooted gene trees and gene alignments to it

cd ../../..
mkdir generax

cp spp_tree/gtdbtk.bac120.user_msa.newick .

cp gene_trees/nif/align/nifHDKT_fsa_renamed_uniq.fna .
cp gene_trees/nif/tree/nifHDKT_fsa_renamed_uniq.fna.newick .

cp gene_trees/reca/align/reca_fsa_renamed_uniq.fna .
cp gene_trees/reca/tree/reca_fsa_renamed_uniq.fna.newick .

Write mapping files, containing the headers of the gene alignments with the name of the species in the species tree (they are the same except for the name of the gene at the end, since they were renamed before aligning)

grep '>' nifHDKT_fsa_renamed_uniq.fna | sed 's/>//g' >> headers_gene_nif.fa
grep '>' nifHDKT_fsa_renamed_uniq.fna | sed 's/>//g' | sed 's/_nif//g' >> headers_spp_nif.fa

grep '>' reca_fsa_renamed_uniq.fna | sed 's/>//g' >> headers_gene_reca.fa
grep '>' reca_fsa_renamed_uniq.fna | sed 's/>//g' | sed 's/_recA//g' >> headers_spp_reca.fa

paste -d ':' headers_spp_nif.fa headers_gene_nif.fa >> mapping_nif.link
paste -d ':' headers_spp_reca.fa headers_gene_reca.fa >> mapping_reca.link

Write the families.txt file, which contains the path to the needed files

echo '[FAMILIES]
- NifHDKT
starting_gene_tree = ./nifHDKT_fsa_renamed_uniq.fna.newick
alignment = ./nifHDKT_fsa_renamed_uniq.fna
mapping = ./mapping_nif.link
subst_model = GTR+G+I
- RecA
starting_gene_tree = ./reca_fsa_renamed_uniq.fna.treefile.newick
alignment = ./reca_fsa_renamed_uniq.fna
mapping = ./mapping_reca.link
subst_model = GTR+G+I' >> families.txt

Run GeneRax with gene tree correction strategy (SPR)

generax --families families.txt --species-tree gtdbtk.bac120.user_msa.fasta.treefile.newick  --rec-model UndatedDTL --prefix correction --strategy SPR

Transform the GeneRax reconciliation to svg with ThirdKind for visualization

cd correction/reconciliations
ls *.xml >> list_files
thirdkind -f list_files -m -o reca_nif_reconciliated_polish2.0.svg