NifHunt: Soil metagenome N fixation analysis from BioBead greenhouse study

Renee Davis

Overview

Nitrogen (N) fixation is a vital microbial process in soils. N-fixing microbes fix atmospheric N2 to a plant-usable form (NH3) via nitrogenase enzyme complex formed from genes nifH, nifD, nifK.

I analyzed the metagenomes from 2 (of 48) soil samples that were used a greenhouse study prior to BioBead treatment.

F1B - bulk field soil
F2R - rhizosphere (root space) soil

The goal is to identify the microbial communities, analyze taxa, screen for nitrogenase genes, and establish a pipeline.

BioBead project

Challenges in soil biodiversity and inoculation

Recent meta analyses indicate poor overall biofertilizer performance,
- Questions about plant pathogenicity and invasive potential if scaled.
How do soil inoculations interact with microbes in soils in which they’re applied?
Can we scale soil ecotechnologies like BioBead? Where is the sweet spot between ecology and engineering?
Welcome to my PhD 😀

Methods

Confirm data integrity with checksums
Quality control: Fast QC and MultiQC
Preprocessing: trimming via Trimmomatic, merging with PEAR
Assembly via Megahit
Phylogenetic tree construction via Megan
Taxonomic identification via taxize
Prodigal for gene prediction (protein coding genes)
HMMER for nif gene screening

Trimming

Trimming removes adapters and low-quality bases.

java -jar /home/shared/16TB_HDD_01/fish546/renee/Trimmomatic-0.39/trimmomatic-0.39.jar PE -threads 4
  F1B-KM40_R1_001.fastq.gz F1B-KM40_R2_001.fastq.gz \
  F1B-KM40_trimmed_R1_paired.fastq.gz F1B-KM40_trimmed_R1_unpaired.fastq.gz \
  F1B-KM40_trimmed_R2_paired.fastq.gz F1B-KM40_trimmed_R2_unpaired.fastq.gz \
  ILLUMINACLIP:TruSeq3-PE.fa:2:30:10 \
  LEADING:3 TRAILING:3 SLIDINGWINDOW:4:20 MINLEN:50  #keeps bps equal or above 50

This produces unpaired and paired file outputs. Paired reads are those which both forward and reverse survived trimming. These are used for downstream analysis like merging and assembly.

Merging

These are R1/R2 (forward and reverse reads) and will have to be merged. This is the last component of pre-processing as we work towards metagenome assembly.

/home/shared/fastp-v0.24.0/fastp \
  -i F1B-KM40_trimmed_R1_paired.fastq.gz \
  -I F1B-KM40_trimmed_R2_paired.fastq.gz \
  --merge \
  --merged_out F1B-KM40_merged.fastq.gz \ 
  
/home/shared/fastp-v0.24.0/fastp \
  -i F2R-KM41_trimmed_R1_paired.fastq.gz \
  -I F2R-KM41_trimmed_R2_paired.fastq.gz \
  --merge \
  --merged_out F2R-KM41_merged.fastq.gz

Assembly

To assemble the metagenome files, MEGAHIT was used.

./megahit 
  -r ../F1B-KM40_merged.fastq.gz  #specifying input file
  -o megahit_F1B_KM40_out   #output directory
  --min-contig-len 500  #over 500 bps
  -t 8  #8 threads

Like other steps, this is done with both files.

./megahit \
  -r ../F2R-KM41_merged.fastq.gz \
  -o megahit_F2R_KM41_out \
  --min-contig-len 500 \
  -t 8

Phylogenetic tree construction via MEGAN

MEGAN is a program that graphs the taxonomical content from metagenomic data sets.

First we query Diamond BLAST.

    /home/shared/diamond-2.1.8 blastx \
    --db /home/shared/16TB_HDD_01/sam/databases/blastdbs/ncbi-nr-20250429.dmnd \
    --query /home/shared/16TB_HDD_01/fish546/renee/F2R-KM41_merged.fastq.gz \
    --out output/sr-blastx.daa \
    --outfmt 100 \
    --top 5 \
    --block-size 15.0 \
    --index-chunks 4 \
    --threads 32

Phylogenetic tree construction via MEGAN

Then we take the .daa files and put it through the meganizer. This is where we get the map.

    /home/shared/megan-6.24.20/tools/daa-meganizer \
    --in output/sr-blastx.daa \
    --threads 32 \
    --mapDB /home/shared/16TB_HDD_01/sr320/github/renee-myco/project/data/megan-map-Feb2022.db

Phylogenetic tree construction via MEGAN

The final step is to convert the .daa files to .rma ones to use in the MEGAN GUI.

/home/shared/megan-6.24.20/tools/daa2rma \
--in output/sr-blastx.daa \
--mapDB /home/shared/16TB_HDD_01/sr320/github/renee-myco/project/data/megan-map-Feb2022.db \
--out output/sr-blastx-meganized.rma6 \
--threads 40 2>&1 | tee --append output/daa2rma_log.txt

Rhizosphere soil phylogenetic tree

Rhizosphere soil top hits

Table 1: Rhizosphere soil top hits (out of 1002 total taxa)
Organism	Classification
Acidobacteria bacterium	Bacteria
Alphaproteobacteria bacterium	Bacteria
Betaproteobacteria bacterium	Bacteria
Verrucomicrobia bacterium	Bacteria
Actinobacteria bacterium	Bacteria

Bulk soil top hits

Table 2: Bulk soil top hits (out of 933 total taxa)
Organism	Classification
Acidobacteria bacterium	Bacteria
Alphaproteobacteria bacterium	Bacteria
Betaproteobacteria bacterium	Bacteria
Deltaproteobacteria bacterium	Bacteria
Verrucomicrobia bacterium	Bacteria

Shared vs unique taxa

Step 1: generate a manageable species list in a .txt file.

/home/shared/megan-6.24.20/tools/rma2info \
    -i output/sr-blastx-f2r.meganized.rma6 \
    -s Taxonomy \
    -o output/f2r_species.txt

This gives the numeric taxon IDs, not species names. We can see species ID in the Megan GUI, but it’s not being extracted with rma2info. We can use R to read the files, map NCBI Taxonomy IDs to names, and summarize counts by species or genus.

Cleanup and taxonomy in R

library(taxize)
library(dplyr)
library(readr)
library(tidyr)

Sys.setenv(ENTREZ_KEY = "d228d1d99f7f95247415e561fdb7a2bdd408") #tells taxize your NCBI ke

map_taxa <- function(file_path, label) {
  df <- read_tsv(file_path, col_names = c("read_id", "tax_id")) %>%
    mutate(
      source = label,
      tax_id = as.character(tax_id) 
    )
  
  unique_ids <- unique(df$tax_id)
  
  # Query NCBI with taxize
  tax_lookup <- taxize::classification(unique_ids, db = "ncbi", messages = FALSE)
  
  # create data frame
  tax_df <- bind_rows(lapply(tax_lookup, function(x) {
    if (inherits(x, "data.frame")) {
      x %>% 
        filter(rank %in% c("genus", "species")) %>%
        select(rank, name) %>%
        pivot_wider(names_from = rank, values_from = name)
    } else {
      tibble(genus = NA_character_, species = NA_character_)
    }
  }), .id = "tax_id")
  
  # Coerce tax_id to character again to match df
  tax_df <- tax_df %>% mutate(tax_id = as.character(tax_id))
  
  # Join and return selected columns
  df %>%
    left_join(tax_df, by = "tax_id") %>%
    select(source, tax_id, genus, species)
}

# load in files
f1b <- map_taxa("output/f1b_species.txt", "f1b")
f2r <- map_taxa("output/f2r_species.txt", "f2r")

# Combine/summarize
combined <- bind_rows(f1b, f2r) %>%
  group_by(source, genus, species) %>%
  summarise(n_reads = n(), .groups = "drop") %>%
  arrange(desc(n_reads))

# pivot for easier comparison
pivoted <- combined %>%
  unite("taxon", genus, species, sep = " ", na.rm = TRUE) %>%
  pivot_wider(names_from = source, values_from = n_reads, values_fill = 0)

print(pivoted)

# save results to .csv for sharing
write_csv(pivoted, "output/taxon_comparison.csv")

Venn diagram of shared and unique taxa

Scatterplot of shared and unique taxa

nif gene analysis

Rubio LM, Ludden PW. 2008. Biosynthesis of the Iron-Molybdenum Cofactor of Nitrogenase. doi:10.1146/annurev.micro.62.081307.162737.

get HMM profiles of 3 nif genes from NCBI
use Prodigal to generate protein coding genes from final assembled contigs
use HMMER to screen files for HMMs

Prodigal to parse protein-coding genes

Nitrogenase is protein coding, so we need to filter out protein coding genes from the assembled metagenomes to avoid false negatives and unreliable data.

/home/shared/16TB_HDD_01/fish546/renee/Prodigal/prodigal \
  -i /home/shared/16TB_HDD_01/fish546/renee/build/megahit_F2R_KM41_out/final.contigs.fa \
  -a /home/shared/16TB_HDD_01/fish546/renee/build/megahit_F2R_KM41_out/predicted_proteins.faa \
  -d /home/shared/16TB_HDD_01/fish546/renee/build/megahit_F2R_KM41_out/predicted_nucleotides.fna \
  -o /home/shared/16TB_HDD_01/fish546/renee/build/megahit_F2R_KM41_out/prodigal_output.gbk \
  -f gbk \
  -p meta

Now we repeat with the second file (F1B, bulk soil). Now we have 2 protein fasta files that we can use for a HMMER search.

HMMER for nif gene screening

Before the search starts, the HMM files need to be uploaded to directory and indexed.

hmmsearch --tblout output/nif_results/F2R/nifH.tbl \
  --cpu 4 \
  /home/shared/16TB_HDD_01/fish546/renee/hmm_profiles/nifH.hmm \
  /home/shared/16TB_HDD_01/fish546/renee/build/megahit_F2R_KM41_out/predicted_proteins.faa

hmmsearch --tblout output//nif_results/F2R/nifD.tbl \
  --cpu 4 \
  /home/shared/16TB_HDD_01/fish546/renee/hmm_profiles/nifD.hmm \
  /home/shared/16TB_HDD_01/fish546/renee/build/megahit_F2R_KM41_out/predicted_proteins.faa

hmmsearch --tblout output//nif_results/F2R/nifK.tbl \
  --cpu 4 \
  /home/shared/16TB_HDD_01/fish546/renee/hmm_profiles/nifK.hmm \
  /home/shared/16TB_HDD_01/fish546/renee/build/megahit_F2R_KM41_out/predicted_proteins.faa

nifH.tbl preview

#                                                                --- full sequence ---- --- best 1 domain ---- --- domain number estimation ----
# target name        accession  query name           accession     E-value  score  bias   E-value  score  bias   exp reg clu  ov env dom rep inc description of target
#------------------- ---------- --------------------  ---------- --------- ------ ----- --------- ------ -----   --- --- --- --- --- --- --- --- ---------------------
k141_2395_1          -          nifH                 TIGR01287.1   1.7e-08   33.9   0.1   2.5e-08   33.4   0.1   1.2   1   0   0   1   1   1   1 # 81 # 650 # 1 # ID=7481_1;partial=01;start_type=GTG;rbs_motif=None;rbs_spacer=None;gc_cont=0.630
k141_11338_2         -          nifH                 TIGR01287.1   2.2e-07   30.3   0.0   2.5e-07   30.1   0.0   1.1   1   0   0   1   1   1   1 # 273 # 569 # 1 # ID=9864_2;partial=01;start_type=GTG;rbs_motif=GGA/GAG/AGG;rbs_spacer=11-12bp;gc_cont=0.687
k141_19658_2

nif genes in metagenome samples

Conclusions

nifH detected but no nifD or nifK
- could be present but not assembled
- nifH is a marker gene and more abundant; also subject to HGT, operons or dormant
- could screen for alternative nitrogenase genes: vnfDHK, anfDHK
- HMM profiles may miss divergent versions of nifD/K if they’re evolutionarily distant from canonical forms
slightly more nif genes in the rhizosphere than bulk soil
detected taxa/top hits are known diazotrophs

Next steps: scaling up

Trim on Trimmomatic
Merge with fastp
Assemble with Megahit
Phylogeny with MEGAN
Functional analysis with Prodigal and HMMER
Get into PCAs/other analyses TBD

Run on Hyak and utilizing bash scripting to manage the large number of files.