NifHunt: Soil metagenome N fixation analysis from BioBead greenhouse study

Introduction

Soil microbial ecology is complex and its composition depends on several factors. Soil microbial communities are found in higher concentrations in the rhizosphere, or root space. Plant roots provide carbohydrates and niches for microbes to flourish. Some of these microbes fix nitrogen, solubilize phosphorous, and perform other necessary activities. Additionally, mutualistic mycorrhizal fungi provide structures for microbial residence and travel. Together, these underground dynamics boost soil carbon and nitrogen, further supporting plant growth and driving these biophilic cycles. However, anthropogenic disturbances like tilling and synthetic fertilizer use create soil dysbiosis and harm this ecological balance.

BioBead is an innovative biodegradable hydrogel format containing microbial consortia, addressing multiple factors in nutrient cycling and soil health. Greenhouse studies with BioBead have demonstrated improved growth for tomatoes and wheat plants, increases in soil organic matter, and promotion of closed nutrient cycling.

For this study, I analyzed the metagenomes from 2 (out of 48) soil samples from one of the recent BioBead greenhouse trials with wheat plants (led by Korena Mafune PhD). These soils were sourced from a nearby farm, with half of the treatments utilizing sterilized soil, and half with untreated soil. The reason for this split design is to understand how soil inoculations perform with native soil microbes compared to sterile soil environments. Analyzing these metagenomes will provide a sense of resident microbial (fungal and bacterial) communities prior to treatment BioBead, conventional fertilizer, and other interventions.

  1. F1B - bulk field soil
  2. F2R - rhizosphere (root space) soil

The soil samples were collected from each treatment at the end of experiment. The DNA was extracted with Qiagen DNAEasy PowerSoil kits. Shotgun metagenome sequencing was provided by Azenta Life Sciences (GENEWIZ).

The purpose of this exercise is to understand shared vs unique taxa in each sample, and screen for the presence of nitrogenase genes (specifically, the 3 subunits. nifH, nifD, nifK that form the nitrogenase complex). These are important, as organisms that fix nitrogen to the soil reduce the need for synthetic fertilizer input. I expect there to be a more diverse microbial community in the rhizosphere soil sample sample over the bulk one. Additionally, I expect to find more nif gene abundance in the rhizosphere sample. If this pipeline is effective, it will be scaled to analyze the entire 48 metagenome data set.

Data upload

Raw fastq files were uploaded to Raven server using rsync.

rsync -av --progress \
    /Volumes/CRUZER16GB/*.fastq.gz \
    rendavis@raven.fish.washington.edu:/home/shared/16TB_HDD_01/fish546/renee/

Data integrity and quality control

Checksums

I ran a MD5 check on the 4 metagenome fastq files.

find . -type f -exec md5sum-lite {} \; >> output.txt  #found all files in the working directory, calculated its MD5 checksum, save to an output txt file

This prints the results to a text file. Here is the output:

b24ae50750964ef8d01a8808051c73c7  ./F2R-KM41_R2_001.fastq.gz
a48a5ad0bc4ce07f641c0c647fe0dbaf  ./F1B-KM40_R2_001.fastq.gz
a8be224ab035ea641ccd5a8c9d46c024  ./F2R-KM41_R1_001.fastq.gz
e34585c824e429f54ff34d919719f1c4  ./F1B-KM40_R1_001.fastq.gz
mv output.txt md5checksums_metagenomes_2025.txt #renamed the file to something more descriptive

To validate file integrity later, the following code could be run to ensure a match.

md5sum -c md5checksums_metagenomes_2025.txt

Quality control

FastQC

I have been told that QC has been run on these, but will verify this using FastQC.

# run FastQC on the fastq files
/home/shared/FastQC-0.12.1/fastqc -t 36 -o output /home/shared/16TB_HDD_01/fish546/renee/*.fastq.gz
Analysis complete for F2R-KM41_R1_001.fastq.gz
Analysis complete for F2R-KM41_R2_001.fastq.gz
Analysis complete for F1B-KM40_R1_001.fastq.gz
Analysis complete for F1B-KM40_R2_001.fastq.gz

MultiQC

eval "$(/opt/anaconda/anaconda3/bin/conda shell.bash hook)"
conda activate
which multiqc

cd output

multiqc .

The html report was saved in the output folder.

Preprocessing

We have to preprocess raw data to enable clean, high-quality reads ready for assembly. This involves 2 steps: trimming and merging forward and reverse reads.

Trimming using Trimmomatic

I need to trim raw data to remove adapters and low-quality bases. It’s important to do this prior to merging, as merging reads relies on the overlapping region between forward and reverse reads. If I try to merge before trimming, I could end up with incorrect overlaps, low quality merges, and related issues.

For this I used Trimmomatic, which required installation.

scp -r ~/Downloads/Trimmomatic-0.39 rendavis@raven.fish.washington.edu:/home/shared/16TB_HDD_01/fish546/renee  #from the local terminal

Now I will run it to trim adapters and low quality bases. I’ll stick with defaults of greater than or equal to 50 base pairs (bp), and 3 leading and trailing bp. I’ll do this in 2 steps with F1B R1/R2 and then F2R R1/R2.

cd /home/shared/16TB_HDD_01/fish546/renee
java -jar /home/shared/16TB_HDD_01/fish546/renee/Trimmomatic-0.39/trimmomatic-0.39.jar PE #PE = Paired-End mode
-threads 4 \
  F1B-KM40_R1_001.fastq.gz F1B-KM40_R2_001.fastq.gz \
  F1B-KM40_trimmed_R1_paired.fastq.gz F1B-KM40_trimmed_R1_unpaired.fastq.gz \
  F1B-KM40_trimmed_R2_paired.fastq.gz F1B-KM40_trimmed_R2_unpaired.fastq.gz \
  ILLUMINACLIP:TruSeq3-PE.fa:2:30:10 \
  LEADING:3 TRAILING:3 SLIDINGWINDOW:4:20 MINLEN:50  #keeps bps equal or above 50 
Quality encoding detected as phred33
Input Read Pairs: 31944124 
Both Surviving: 29010263 (90.82%) 
Forward Only Surviving: 1518244 (4.75%) 
Reverse Only Surviving: 921543 (2.88%) 
Dropped: 494074 (1.55%)
TrimmomaticPE: Completed successfully

These are good numbers. Generally, anything over 80% both surviving is considered good. And 1.5% is a very low drop rate.

cd /home/shared/16TB_HDD_01/fish546/renee/raw-data  
java -jar /home/shared/16TB_HDD_01/fish546/renee/Trimmomatic-0.39/trimmomatic-0.39.jar PE -threads 4 \
  F2R-KM41_R1_001.fastq.gz F2R-KM41_R2_001.fastq.gz \
  F2R-KM41_trimmed_R1_paired.fastq.gz F2R-KM41_trimmed_R1_unpaired.fastq.gz \
  F2R-KM41_trimmed_R2_paired.fastq.gz F2R-KM41_trimmed_R2_unpaired.fastq.gz \
  ILLUMINACLIP:/home/shared/16TB_HDD_01/fish546/renee/Trimmomatic-0.39/adapters/TruSeq3-PE.fa:2:30:10  #had to use the whole file path here
  LEADING:3 TRAILING:3 SLIDINGWINDOW:4:20 MINLEN:50  #keeps bps equal or above 50 
Quality encoding detected as phred33
Input Read Pairs: 27015389 
Both Surviving: 26807327 (99.23%) 
Forward Only Surviving: 208057 (0.77%) 
Reverse Only Surviving: 0 (0.00%) 
Dropped: 5 (0.00%)
TrimmomaticPE: Completed successfully

Trimmomatic produces unpaired and paired file outputs. Paired reads are those which both forward and reverse survived trimming. Unpaired reads indicate where only one of the pair survived (the other was discarded due to low quality or short length).

I’ll use the paired files for most downstream steps like merging and assembly, but will keep the unpaired files around just in case we need to embark on deeper analysis.

Merging forward and reverse reads using fastp

These are R1/R2 (forward and reverse reads) and will have to be merged. This is the last component of pre-processing as we work towards metagenome assembly. For this we will use fastp which is already installed on Raven.

cd /home/shared/16TB_HDD_01/fish546/renee

# Merge F1B-KM40
/home/shared/fastp-v0.24.0/fastp \
  -i F1B-KM40_trimmed_R1_paired.fastq.gz \
  -I F1B-KM40_trimmed_R2_paired.fastq.gz \
  --merge \
  --merged_out F1B-KM40_merged.fastq.gz \
  
# Merge F2R-KM41
/home/shared/fastp-v0.24.0/fastp \
  -i F2R-KM41_trimmed_R1_paired.fastq.gz \
  -I F2R-KM41_trimmed_R2_paired.fastq.gz \
  --merge \
  --merged_out F2R-KM41_merged.fastq.gz \
Read1 before filtering:
total reads: 29010263
total bases: 4140614008
Q20 bases: 4050447746(97.8224%)
Q30 bases: 3845738339(92.8785%)

Read2 before filtering:
total reads: 29010263
total bases: 4154017583
Q20 bases: 4065704832(97.874%)
Q30 bases: 3866115895(93.0693%)

Merged and filtered:
total reads: 7474936
total bases: 1650421008
Q20 bases: 1630687369(98.8043%)
Q30 bases: 1578354375(95.6334%)

Filtering result:
reads passed filter: 58018466
reads failed due to low quality: 0
reads failed due to too many N: 2060
reads failed due to too short: 0
reads with adapter trimmed: 635320
bases trimmed due to adapters: 11744183
reads corrected by overlap analysis: 2160886
bases corrected by overlap analysis: 2804599

Duplication rate: 7.13807%

Insert size peak (evaluated by paired-end reads): 268

Read pairs merged: 7474936
% of original read pairs: 25.7665%
% in reads after filtering: 100%


JSON report: fastp.json
HTML report: fastp.html

/home/shared/fastp-v0.24.0/fastp -i F1B-KM40_trimmed_R1_paired.fastq.gz -I F1B-KM40_trimmed_R2_paired.fastq.gz --merge --merged_out F1B-KM40_merged.fastq.gz 
fastp v0.24.0, time used: 178 seconds
Read1 before filtering:
total reads: 26807327
total bases: 4021099050
Q20 bases: 3808188117(94.7052%)
Q30 bases: 3500492825(87.0531%)

Read2 before filtering:
total reads: 26807327
total bases: 4021099050
Q20 bases: 3793251312(94.3337%)
Q30 bases: 3477357474(86.4778%)

Merged and filtered:
total reads: 6724249
total bases: 1525795473
Q20 bases: 1489384743(97.6137%)
Q30 bases: 1418180834(92.947%)

Filtering result:
reads passed filter: 51640358
reads failed due to low quality: 1969720
reads failed due to too many N: 4576
reads failed due to too short: 0
reads with adapter trimmed: 170138
bases trimmed due to adapters: 2613154
reads corrected by overlap analysis: 2552667
bases corrected by overlap analysis: 5056209

Duplication rate: 6.67899%

Insert size peak (evaluated by paired-end reads): 268

Read pairs merged: 6724249
% of original read pairs: 25.0836%
% in reads after filtering: 100%

Around 25% of original read pairs got merged, which appears to be typical for paired-end read merging.

As a final step, we’ll preview the file and do a quality check on the merged files using FastQC.

# run FastQC on the fastq files
/home/shared/FastQC-0.12.1/fastqc -t 36 -o output /home/shared/16TB_HDD_01/fish546/renee/*merged.fastq.gz

Here are the HTML report results for F1B and F2R. F1B has higher per base sequence quality, but both look good.

Assembly using MEGAHIT

Installation

First I clone the repository into our working folder.

cd /home/shared/16TB_HDD_01/fish546/renee/
git clone https://github.com/voutcn/megahit.git #this brings the installation files into our working directory
cd megahit #navigating to the folder

Then I have to compile MEGAHIT.

cd /home/shared/16TB_HDD_01/fish546/renee/
mkdir build
cd build
cmake ../megahit
make

./megahit --version # tests the installation; MEGAHIT v1.2.9

Assembly

Once this is done, I ran each file independently to produce two assemblies.

cd /home/shared/16TB_HDD_01/fish546/renee/build  #pointing wd to the binary files
./megahit -r ../F1B-KM40_merged.fastq.gz  #specifying input file
-o megahit_F1B_KM40_out   #output directory
--min-contig-len 500  #over 500 bps
-t 8  #8 threads
cd /home/shared/16TB_HDD_01/fish546/renee/build

./megahit \
  -r ../F2R-KM41_merged.fastq.gz \
  -o megahit_F2R_KM41_out \
  --min-contig-len 500 \
  -t 8

File check

I want to see a text file when we check files using the head command.

#check F1B_KM40
cd /home/shared/16TB_HDD_01/fish546/renee/build/megahit_F1B_KM40_out
head final.contigs.fa
#checkF2R_KM41
cd /home/shared/16TB_HDD_01/fish546/renee/build/megahit_F2R_KM41_out
head final.contigs.fa

The output should like this:

>k141_8240 flag=1 multi=2.0000 len=511
GGGGAACGCGCGCGGACGCGGGAAGGCCGCCTTCAGGTCGAAC

> = start of a header.For the meta data: k141 refers to the k-mer size (k=141) used when assembling that contig. 8240 is the ID or serial number assigned to that contig. It’s the name of the contig. Then there is the internal MEGAHIT flag for the contig, usually about whether the contig is trusted/high-confidence (flag=1 usually means it’s “normal” and usable). Then there is the multiplicity (or coverage estimate), which refers to how many times this contig appeared in the assembly graph. Finally, we have the contig length.

Phylogenetic trees using Megan

F2R - rhizosphere soil

MEGAN is a program that graphs the taxonomical content from metagenomic data sets. To start, I’ll query Diamond BLAST with the rhizospheric soil metagenome fastq file.

    /home/shared/diamond-2.1.8 blastx \
    --db /home/shared/16TB_HDD_01/sam/databases/blastdbs/ncbi-nr-20250429.dmnd \
    --query /home/shared/16TB_HDD_01/fish546/renee/F2R-KM41_merged.fastq.gz \
    --out data/sr-blastx-f2r.daa \
    --outfmt 100 \
    --top 5 \
    --block-size 15.0 \
    --index-chunks 4 \
    --threads 32
    /home/shared/megan-6.24.20/tools/daa-meganizer \
    --in data/sr-blastx-f2r.daa \
    --threads 32 \
    --mapDB /home/shared/16TB_HDD_01/sr320/github/renee-myco/project/data/megan-map-Feb2022.db #this is the taxonomy database file, updated Feb 2022
/home/shared/megan-6.24.20/tools/daa2rma \
--in output/sr-blastx-f2r.daa \
--mapDB /home/shared/16TB_HDD_01/sr320/github/renee-myco/project/data/megan-map-Feb2022.db \
--out data/sr-blastx-meganized.rma6 \
--threads 40 2>&1 | tee --append output/daa2rma_log.txt

The tree can be viewed here.

A total of 1002 taxa were detected. The top hits include Acidobacteria bacterium, Actinobacteria bacterium, Chloroflexi bacterium, Verrucomicrobia bacterium. (All are bacteria.) Genera of fungi detected include: Mortierellaceae, Diversisporales, Paraglomus, Rhizophagus, Metarhizium.

F1B - bulk field soil

Now we will perform the same functions using the second fasta file. First we run the fastq file against the NCBI database.

    /home/shared/diamond-2.1.8 blastx \
    --db /home/shared/16TB_HDD_01/sam/databases/blastdbs/ncbi-nr-20250429.dmnd \
    --query /home/shared/16TB_HDD_01/fish546/renee/F1B-KM40_merged.fastq.gz \
    --out output/sr-blastx-f1b.daa \
    --outfmt 100 \
    --top 5 \
    --block-size 15.0 \
    --index-chunks 4 \
    --threads 32

Then we start to get the data in MEGAN.

    /home/shared/megan-6.24.20/tools/daa-meganizer \
    --in output/sr-blastx-f1b.daa \
    --threads 32 \
    --mapDB /home/shared/16TB_HDD_01/sr320/github/renee-myco/project/data/megan-map-Feb2022.db 

This produces a .rma file that can be opened into MEGAN and the metadata used to create a phylogenetic tree.

/home/shared/megan-6.24.20/tools/daa2rma \
--in output/sr-blastx-f1b.daa \
--mapDB /home/shared/16TB_HDD_01/sr320/github/renee-myco/project/data/megan-map-Feb2022.db \
--out output/sr-blastx-f1b.meganized.rma6 \
--threads 40 2>&1 | tee --append output/daa2rma_log2.txt

The tree can be viewed here.

A total of 933 taxa were detected. The top hits include Acidobacteria bacterium, Alphaproteobacteria bacterium, Betaproteobacteria bacterium, Deltaproteobacteria bacterium. (All are bacteria.) Groups of fungi detected include: Mortierellaceae, Diversisporales, Paraglomus, Rhizophagus, Aspergillaceae.

While there are fewer taxa than in rhizosphere soil, the general composition seems to be similar. Most soil microbes are located in the rhizosphere, so this is consistent with our expectations.

Metagenome data comparison

Now that I have assembled metagenomes and taxonomy for each sample, I want to know how they compare. Specifically, I’m interested in what taxa they have in common and which are unique.

I could export the data in Megan (.rma6 file) to a .txt, then convert the data into a common format like .xlsx or Google Sheets for comparison. But the .txt file sizes can be too large (>600 MB) for these programs. Instead I can use some command-line tools to get a more manageable species list as a .txt in the output folder.

/home/shared/megan-6.24.20/tools/rma2info \
    -i output/sr-blastx-f2r.meganized.rma6 \
    -s Taxonomy \
    -o output/f2r_species.txt
    
/home/shared/megan-6.24.20/tools/rma2info \
    -i output/sr-blastx-f1b.meganized.rma6 \
    -r2c Taxonomy \
    -o output/f1b_species.txt

Unfortunately, we only have the numeric taxon IDs, not species names. We can see species ID in the Megan GUI, but it’s not being extracted with rma2info.

So I’ll use R to read the files, map NCBI Taxonomy IDs to names, and summarize counts by species or genus.

R studio cleanup

library(taxize)
library(dplyr)
library(readr)
library(tidyr)

Sys.setenv(ENTREZ_KEY = "d228d1d99f7f95247415e561fdb7a2bdd408") #tells taxize your NCBI ke

map_taxa <- function(file_path, label) {
  df <- read_tsv(file_path, col_names = c("read_id", "tax_id")) %>%
    mutate(
      source = label,
      tax_id = as.character(tax_id) 
    )
  
  unique_ids <- unique(df$tax_id)
  
  # Query NCBI with taxize
  tax_lookup <- taxize::classification(unique_ids, db = "ncbi", messages = FALSE)
  
  # create data frame
  tax_df <- bind_rows(lapply(tax_lookup, function(x) {
    if (inherits(x, "data.frame")) {
      x %>% 
        filter(rank %in% c("genus", "species")) %>%
        select(rank, name) %>%
        pivot_wider(names_from = rank, values_from = name)
    } else {
      tibble(genus = NA_character_, species = NA_character_)
    }
  }), .id = "tax_id")
  
  # Coerce tax_id to character again to match df
  tax_df <- tax_df %>% mutate(tax_id = as.character(tax_id))
  
  # Join and return selected columns
  df %>%
    left_join(tax_df, by = "tax_id") %>%
    select(source, tax_id, genus, species)
}

# load in files
f1b <- map_taxa("output/f1b_species.txt", "f1b")
f2r <- map_taxa("output/f2r_species.txt", "f2r")

# Combine/summarize
combined <- bind_rows(f1b, f2r) %>%
  group_by(source, genus, species) %>%
  summarise(n_reads = n(), .groups = "drop") %>%
  arrange(desc(n_reads))

# pivot for easier comparison
pivoted <- combined %>%
  unite("taxon", genus, species, sep = " ", na.rm = TRUE) %>%
  pivot_wider(names_from = source, values_from = n_reads, values_fill = 0)

print(pivoted)

# save results to .csv for sharing
write_csv(pivoted, "output/taxon_comparison.csv")

The .csv file was imported to a GoogleSheet for cleanup, and exported as a new metagenome_comparison.csv for further analysis and visualization.

Venn diagram and scatterplot of shared and unique taxa

# Load required packages
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(VennDiagram)
Loading required package: grid
Loading required package: futile.logger
# load data
data <- read_csv("output/metagenome_comparison.csv")
Rows: 433 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): taxon
dbl (2): F1B, F2R

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Prepare presence/absence
data_presence <- data %>%
  mutate(
    F1B_present = F1B > 0,
    F2R_present = F2R > 0
  )

# Venn diagram counts
only_F1B <- sum(data_presence$F1B_present & !data_presence$F2R_present)
only_F2R <- sum(!data_presence$F1B_present & data_presence$F2R_present)
both <- sum(data_presence$F1B_present & data_presence$F2R_present)

# Draw Venn diagram
venn.plot <- draw.pairwise.venn(
  area1 = only_F1B + both,
  area2 = only_F2R + both,
  cross.area = both,
  category = c("F1B", "F2R"),
  fill = c("skyblue", "orange"),
  alpha = 0.5,
  cat.pos = c(-20, 20),
  scaled = FALSE
)

# save to PDF
pdf("images/taxa_venn_diagram.pdf")
grid.draw(venn.plot)
dev.off()
png 
  2 
# abundance scatterplot data rep
scatter_data <- data_presence %>%
  mutate(
    shared_status = case_when(
      F1B_present & F2R_present ~ "Shared",
      F1B_present & !F2R_present ~ "F1B only",
      !F1B_present & F2R_present ~ "F2R only"
    )
  )

# Scatter plot comparing abundances
ggplot(scatter_data, aes(x = F1B, y = F2R, color = shared_status)) +
  geom_point(alpha = 0.7) +
  scale_x_log10() +
  scale_y_log10() +
  scale_color_manual(values = c("Shared" = "purple", "F1B only" = "blue", "F2R only" = "red")) +
  labs(
    title = "Comparison of Taxon Abundance Between Metagenomes",
    x = "F1B Read Counts (log10 scale)",
    y = "F2R Read Counts (log10 scale)",
    color = "Taxon Status"
  ) +
  theme_minimal()
Warning in scale_x_log10(): log-10 transformation introduced infinite values.
Warning in scale_y_log10(): log-10 transformation introduced infinite values.

# save scatter plot
ggsave("images/taxa_abundance_comparison.png", width = 8, height = 6)
Warning in scale_x_log10(): log-10 transformation introduced infinite values.
log-10 transformation introduced infinite values.

There is a discrepancy between the ~400 taxa captured by Taxize in R (and displayed in the Venn diagram and scatterplot) versus the number of taxa described by MEGAN. The most likely reasons are:

  1. differences and issues in taxonomic name resolution - Taxize may not recognize a taxon if it is informal or incomplete (e.g., "Bacteria_unclassified") or an environmental placeholder (e.g., "Candidatus Solibacter").
  2. taxonomic duplication - MEGAN counts different strains of the same species (e.g. Bradyrhizobium sp. AUGA SZCCT0282, Bradyrhizobium sp. SRL28) as 2 taxa.

Compared to Taxize, MEGAN tends to keep ambiguous entries and makes broader inferences with naming. Declaring a species count will depend on chosen parameters and assumptions. If one wants to be more conservative, stick with Taxize.

Analysis of nitrogenase gene clusters

Nitrogenase is a bacterial enzyme that fixes atmospheric nitrogen (N2) to ammonia (NH3). It consists of alpha (nifD) and beta (nifK) subunits of the molybdenum-iron protein and an ATP-binding iron-sulfur protein (nifH). We will need to use some annotation tools to screen for nif genes in our samples (e.g., nifH, nifD, nifK).

To do this, we will use a tool called HMMER to search sequence databases for sequence homologs like nif genes. It uses probabilistic models called profile Hidden Markov models (profile HMMs).

nif gene activity between samples

In theory, we could use HMMER to get the number of nitrogenase genes directly from our merged metagenome files. But since nitrogenase genes are protein-coding, and HMMER works best on predicted proteins, raw nucleotide sequences in our merged files are not ideal and could lead to false negatives.

Prodigal for gene prediction

First, I need to predict proteins from the contigs from the assembly step. Prodigal was cloned from GitHub, installed /home/shared/16TB_HDD_01/fish546/renee/ and compiled using commands similar to those in previous sections.

/home/shared/16TB_HDD_01/fish546/renee/Prodigal/prodigal \
  -i /home/shared/16TB_HDD_01/fish546/renee/build/megahit_F2R_KM41_out/final.contigs.fa \
  -a /home/shared/16TB_HDD_01/fish546/renee/build/megahit_F2R_KM41_out/predicted_proteins.faa \
  -d /home/shared/16TB_HDD_01/fish546/renee/build/megahit_F2R_KM41_out/predicted_nucleotides.fna \
  -o /home/shared/16TB_HDD_01/fish546/renee/build/megahit_F2R_KM41_out/prodigal_output.gbk \
  -f gbk \
  -p meta

The output confirmed a number of hits for this file. We can see predicted_proteins.faa, but unfortunately cannot preview the file using the head command. Now we’ll do this for the second metagenome.

/home/shared/16TB_HDD_01/fish546/renee/Prodigal/prodigal \
  -i /home/shared/16TB_HDD_01/fish546/renee/build/megahit_F1B_KM40_out/final.contigs.fa \  # input FASTA file of assembled contigs
  -a /home/shared/16TB_HDD_01/fish546/renee/build/megahit_F1B_KM40_out/predicted_proteins.faa \  # output amino acid sequences (for HMMER screening)
  -d /home/shared/16TB_HDD_01/fish546/renee/build/megahit_F1B_KM40_out/predicted_nucleotides.fna \
  -o /home/shared/16TB_HDD_01/fish546/renee/build/megahit_F1B_KM40_out/prodigal_output.gbk \
  -f gbk \  #format of output file, GenBank format
  -p meta. #mode optimized for metagenomic data

Now we have 2 protein fasta files that we can use for a HMMER search.

HMMER

I searched the protein fasta files for specific HMMs. To do this, I downloaded the HMM files from the NCBI accession pages in the previous section, then uploaded them to Raven via scp command. Before we run a search, I needed to index the .hmm files.

cd /home/shared/16TB_HDD_01/fish546/renee/hmm_profiles/
hmmpress nifH.hmm
hmmpress nifD.hmm
hmmpress nifK.hmm

Then I was ready to start searching the F2R protein fasta.

hmmsearch --tblout output/nif_results/F2R/nifH.tbl \
  --cpu 4 \
  /home/shared/16TB_HDD_01/fish546/renee/hmm_profiles/nifH.hmm \
  /home/shared/16TB_HDD_01/fish546/renee/build/megahit_F2R_KM41_out/predicted_proteins.faa

hmmsearch --tblout output//nif_results/F2R/nifD.tbl \
  --cpu 4 \
  /home/shared/16TB_HDD_01/fish546/renee/hmm_profiles/nifD.hmm \
  /home/shared/16TB_HDD_01/fish546/renee/build/megahit_F2R_KM41_out/predicted_proteins.faa

hmmsearch --tblout output//nif_results/F2R/nifK.tbl \
  --cpu 4 \
  /home/shared/16TB_HDD_01/fish546/renee/hmm_profiles/nifK.hmm \
  /home/shared/16TB_HDD_01/fish546/renee/build/megahit_F2R_KM41_out/predicted_proteins.faa

Now we will repeat the same commands for F1B.

hmmsearch --tblout output/nif_results/F1B/nifH.tbl \
  --cpu 4 \
  /home/shared/16TB_HDD_01/fish546/renee/hmm_profiles/nifH.hmm \
  /home/shared/16TB_HDD_01/fish546/renee/build/megahit_F1B_KM40_out/predicted_proteins.faa

hmmsearch --tblout output//nif_results/F1B/nifD.tbl \
  --cpu 4 \
  /home/shared/16TB_HDD_01/fish546/renee/hmm_profiles/nifD.hmm \
  /home/shared/16TB_HDD_01/fish546/renee/build/megahit_F1B_KM40_out/predicted_proteins.faa

hmmsearch --tblout output//nif_results/F1B/nifK.tbl \
  --cpu 4 \
  /home/shared/16TB_HDD_01/fish546/renee/hmm_profiles/nifK.hmm \
  /home/shared/16TB_HDD_01/fish546/renee/build/megahit_F1B_KM40_out/predicted_proteins.faa

Visualizing nif gene hits

We can visualize the number of hits of nif genes across the samples using a simple bar graph. We will keep the 0 counts for nifD and nifK in the interest of a complete data presentation.

# Load libraries
library(tidyverse)
library(ggplot2)

# paths to .tbl files
tbl_files <- list(
  F1B_nifH = "output/nif_results/F1B/nifH.tbl",
  F1B_nifD = "output/nif_results/F1B/nifD.tbl",
  F1B_nifK = "output/nif_results/F1B/nifK.tbl",
  F2R_nifH = "output/nif_results/F2R/nifH.tbl",
  F2R_nifD = "output/nif_results/F2R/nifD.tbl",
  F2R_nifK = "output/nif_results/F2R/nifK.tbl"
)

# parse tblout and count hits
parse_tbl <- function(file, sample, gene) {
  lines <- readLines(file)
  data_lines <- lines[!grepl("^#", lines)]  # Remove comment lines
  n_hits <- length(data_lines)  # Count non-comment lines
  tibble(sample = sample, gene = gene, hits = n_hits)
}

# making a data frame
results <- bind_rows(
  parse_tbl(tbl_files$F1B_nifH, "F1B", "nifH"),
  parse_tbl(tbl_files$F1B_nifD, "F1B", "nifD"),
  parse_tbl(tbl_files$F1B_nifK, "F1B", "nifK"),
  parse_tbl(tbl_files$F2R_nifH, "F2R", "nifH"),
  parse_tbl(tbl_files$F2R_nifD, "F2R", "nifD"),
  parse_tbl(tbl_files$F2R_nifK, "F2R", "nifK")
)

# simple barplot of gene hits per sample
ggplot(results, aes(x = gene, y = hits, fill = sample)) +
  geom_col(position = position_dodge()) +
  labs(title = "Nitrogenase Gene Hits per Metagenome",
       x = "Gene",
       y = "Number of Hits") +
  theme_minimal()

Conclusions

In this dataset, nifH was detected, but nifD and nifK were not. For a microbe to fix N, all 3 need to be present and active. This could reflect fragmented operons that were present in the community, but not fully assembled due to sequencing or assembly limitations. As a commonly used marker gene for N fixation, nifH tends to be more abundant and is also more prone to horizontal gene transfer, making it possible to detect it even in the absence of a complete nitrogenase operon. (It’s also possible that nifH is retained in dormant or non-functional contexts.) For a fuller picture, screening for alternative nitrogenase genes such as vnfDHK and anfDHK may reveal additional diazotrophic potential. Also, current HMM profiles may fail to detect divergent nifD/K sequences, particularly if they differ substantially from their canonical forms. Interestingly, there were slightly more nif genes in rhizosphere samples than in bulk soil, and many of the top hits mapped to known diazotrophic taxa, supporting the ecological relevance of these findings.

With all of that being said, the number of nifH hits are very low for an agricultural soil. This could occur for several reasons:

1. HMM thresholds are too strict - check e-value and scores, maybe lower them them more lenient,

2. HMM database mismatch - double check to see if others should be used.

I could benchmark by downloading a few soil metagenomes from MG-RAST or NCBI SRA, running the same pipeline, and seeing if they yield higher nifH counts.

Next steps: scaling up

Following the establishment of this initial pipeline, I am planning a scale up of the metagenome processing efforts from this greenhouse study.

  • Trim on Trimmomatic

  • Merge with fastp

  • Assemble with Megahit

  • Phylogeny with MEGAN

  • Functional analysis with Prodigal and HMMER

  • PCAs/other analyses (e.g. heat maps) as needed

This would take place on the Hyak HPC cluter and utilize bash scripting (e.g. xargs) to support batch Unix commands.

DOI

References

Bolger, A. M., Lohse, M., & Usadel, B.(2014). Trimmomatic: A flexible trimmer for Illumina Sequence Data. Bioinformatics, btu170.

F. Meyer, D. Paarmann, M. D’Souza, R. Olson , E.M. Glass, M. Kubal, T. Paczian, A. Rodriguez, R. Stevens, A. Wilke, J. Wilkening, and R. A. Edwards. The Metagenomics RAST server — A public resource for the automatic phylogenetic and functional analysis of metagenomes. BMC Bioinformatics 2008, 9:386. http://www.biomedcentral.com/1471-2105/ 9/386

Huson DH, Auch AF, Qi J, Schuster SC. MEGAN analysis of metagenomic data. Genome Res. 2007 Mar;17(3):377-86. doi: 10.1101/gr.5969107. Epub 2007 Jan 25. PMID: 17255551; PMCID: PMC1800929.

Hyatt, D., Chen, GL., LoCascio, P.F. et al. Prodigal: prokaryotic gene recognition and translation initiation site identification. BMC Bioinformatics 11, 119 (2010). https://doi.org/10.1186/1471-2105-11-119

NFixDB (Nitrogen Fixation DataBase) - A Comprehensive Integrated Database for Robust ’Omics Analysis of Diazotrophs. (preprint)

Masuda Y, Mise K, Xu Z, Zhang Z, Shiratori Y, Senoo K, Itoh H. 2024. Global soil metagenomics reveals distribution and predominance of Deltaproteobacteria in nitrogen-fixing microbiome. Microbiome. 12(1):95. doi:10.1186/s40168-024-01812-1.

Song Z-Q, Wang L, Liang F, Zhou Q, Pei D, Jiang H, Li W-J. 2022. nifH gene expression and diversity in geothermal springs of Tengchong, China. Front Microbiol. 13:980924. doi:10.3389/fmicb.2022.980924.

Thomas T, Gilbert J, Meyer F. Metagenomics - a guide from sampling to data analysis. Microb Inform Exp. 2012 Feb 9;2(1):3. doi: 10.1186/2042-5783-2-3.

Zhang J, Kobert K, Flouri T, Stamatakis A. PEAR: A fast and accurate Illimuna Paired-End reAd mergeR.

Vollmers J, Wiegand S, Kaster AK. Comparing and Evaluating Metagenome Assembly Tools from a Microbiologist’s Perspective - Not Only Size Matters! PLoS One. 2017 Jan 18;12(1):e0169662. doi: 10.1371/journal.pone.0169662. PMID: 28099457; PMCID: PMC5242441.