Nitrogen (N) fixation is a vital microbial process in soils. N-fixing microbes fix atmospheric N2 to a plant-usable form (NH3) via nitrogenase enzyme complex formed from genes nifH, nifD, nifK.
I analyzed the metagenomes from 2 (of 48) soil samples that were used a greenhouse study prior to BioBead treatment.
The goal is to identify the microbial communities, analyze taxa, screen for nitrogenase genes, and establish a pipeline.
Recent meta analyses indicate poor overall biofertilizer performance,
How do soil inoculations interact with microbes in soils in which they’re applied?
Can we scale soil ecotechnologies like BioBead? Where is the sweet spot between ecology and engineering?
Welcome to my PhD 😀
Trimming removes adapters and low-quality bases.
java -jar /home/shared/16TB_HDD_01/fish546/renee/Trimmomatic-0.39/trimmomatic-0.39.jar PE -threads 4
F1B-KM40_R1_001.fastq.gz F1B-KM40_R2_001.fastq.gz \
F1B-KM40_trimmed_R1_paired.fastq.gz F1B-KM40_trimmed_R1_unpaired.fastq.gz \
F1B-KM40_trimmed_R2_paired.fastq.gz F1B-KM40_trimmed_R2_unpaired.fastq.gz \
ILLUMINACLIP:TruSeq3-PE.fa:2:30:10 \
LEADING:3 TRAILING:3 SLIDINGWINDOW:4:20 MINLEN:50 #keeps bps equal or above 50 This produces unpaired and paired file outputs. Paired reads are those which both forward and reverse survived trimming. These are used for downstream analysis like merging and assembly.
These are R1/R2 (forward and reverse reads) and will have to be merged. This is the last component of pre-processing as we work towards metagenome assembly.
/home/shared/fastp-v0.24.0/fastp \
-i F1B-KM40_trimmed_R1_paired.fastq.gz \
-I F1B-KM40_trimmed_R2_paired.fastq.gz \
--merge \
--merged_out F1B-KM40_merged.fastq.gz \
/home/shared/fastp-v0.24.0/fastp \
-i F2R-KM41_trimmed_R1_paired.fastq.gz \
-I F2R-KM41_trimmed_R2_paired.fastq.gz \
--merge \
--merged_out F2R-KM41_merged.fastq.gz To assemble the metagenome files, MEGAHIT was used.
./megahit
-r ../F1B-KM40_merged.fastq.gz #specifying input file
-o megahit_F1B_KM40_out #output directory
--min-contig-len 500 #over 500 bps
-t 8 #8 threadsLike other steps, this is done with both files.
MEGAN is a program that graphs the taxonomical content from metagenomic data sets.
First we query Diamond BLAST.
Then we take the .daa files and put it through the meganizer. This is where we get the map.
The final step is to convert the .daa files to .rma ones to use in the MEGAN GUI.
| Organism | Classification |
|---|---|
| Acidobacteria bacterium | Bacteria |
| Alphaproteobacteria bacterium | Bacteria |
| Betaproteobacteria bacterium | Bacteria |
| Verrucomicrobia bacterium | Bacteria |
| Actinobacteria bacterium | Bacteria |
| Organism | Classification |
|---|---|
| Acidobacteria bacterium | Bacteria |
| Alphaproteobacteria bacterium | Bacteria |
| Betaproteobacteria bacterium | Bacteria |
| Deltaproteobacteria bacterium | Bacteria |
| Verrucomicrobia bacterium | Bacteria |
Step 1: generate a manageable species list in a .txt file.
/home/shared/megan-6.24.20/tools/rma2info \
-i output/sr-blastx-f2r.meganized.rma6 \
-s Taxonomy \
-o output/f2r_species.txtThis gives the numeric taxon IDs, not species names. We can see species ID in the Megan GUI, but it’s not being extracted with rma2info. We can use R to read the files, map NCBI Taxonomy IDs to names, and summarize counts by species or genus.
library(taxize)
library(dplyr)
library(readr)
library(tidyr)
Sys.setenv(ENTREZ_KEY = "d228d1d99f7f95247415e561fdb7a2bdd408") #tells taxize your NCBI ke
map_taxa <- function(file_path, label) {
df <- read_tsv(file_path, col_names = c("read_id", "tax_id")) %>%
mutate(
source = label,
tax_id = as.character(tax_id)
)
unique_ids <- unique(df$tax_id)
# Query NCBI with taxize
tax_lookup <- taxize::classification(unique_ids, db = "ncbi", messages = FALSE)
# create data frame
tax_df <- bind_rows(lapply(tax_lookup, function(x) {
if (inherits(x, "data.frame")) {
x %>%
filter(rank %in% c("genus", "species")) %>%
select(rank, name) %>%
pivot_wider(names_from = rank, values_from = name)
} else {
tibble(genus = NA_character_, species = NA_character_)
}
}), .id = "tax_id")
# Coerce tax_id to character again to match df
tax_df <- tax_df %>% mutate(tax_id = as.character(tax_id))
# Join and return selected columns
df %>%
left_join(tax_df, by = "tax_id") %>%
select(source, tax_id, genus, species)
}
# load in files
f1b <- map_taxa("output/f1b_species.txt", "f1b")
f2r <- map_taxa("output/f2r_species.txt", "f2r")
# Combine/summarize
combined <- bind_rows(f1b, f2r) %>%
group_by(source, genus, species) %>%
summarise(n_reads = n(), .groups = "drop") %>%
arrange(desc(n_reads))
# pivot for easier comparison
pivoted <- combined %>%
unite("taxon", genus, species, sep = " ", na.rm = TRUE) %>%
pivot_wider(names_from = source, values_from = n_reads, values_fill = 0)
print(pivoted)
# save results to .csv for sharing
write_csv(pivoted, "output/taxon_comparison.csv")Rubio LM, Ludden PW. 2008. Biosynthesis of the Iron-Molybdenum Cofactor of Nitrogenase. doi:10.1146/annurev.micro.62.081307.162737.
get HMM profiles of 3 nif genes from NCBI
use Prodigal to generate protein coding genes from final assembled contigs
use HMMER to screen files for HMMs
Nitrogenase is protein coding, so we need to filter out protein coding genes from the assembled metagenomes to avoid false negatives and unreliable data.
/home/shared/16TB_HDD_01/fish546/renee/Prodigal/prodigal \
-i /home/shared/16TB_HDD_01/fish546/renee/build/megahit_F2R_KM41_out/final.contigs.fa \
-a /home/shared/16TB_HDD_01/fish546/renee/build/megahit_F2R_KM41_out/predicted_proteins.faa \
-d /home/shared/16TB_HDD_01/fish546/renee/build/megahit_F2R_KM41_out/predicted_nucleotides.fna \
-o /home/shared/16TB_HDD_01/fish546/renee/build/megahit_F2R_KM41_out/prodigal_output.gbk \
-f gbk \
-p metaNow we repeat with the second file (F1B, bulk soil). Now we have 2 protein fasta files that we can use for a HMMER search.
Before the search starts, the HMM files need to be uploaded to directory and indexed.
hmmsearch --tblout output/nif_results/F2R/nifH.tbl \
--cpu 4 \
/home/shared/16TB_HDD_01/fish546/renee/hmm_profiles/nifH.hmm \
/home/shared/16TB_HDD_01/fish546/renee/build/megahit_F2R_KM41_out/predicted_proteins.faa
hmmsearch --tblout output//nif_results/F2R/nifD.tbl \
--cpu 4 \
/home/shared/16TB_HDD_01/fish546/renee/hmm_profiles/nifD.hmm \
/home/shared/16TB_HDD_01/fish546/renee/build/megahit_F2R_KM41_out/predicted_proteins.faa
hmmsearch --tblout output//nif_results/F2R/nifK.tbl \
--cpu 4 \
/home/shared/16TB_HDD_01/fish546/renee/hmm_profiles/nifK.hmm \
/home/shared/16TB_HDD_01/fish546/renee/build/megahit_F2R_KM41_out/predicted_proteins.faa# --- full sequence ---- --- best 1 domain ---- --- domain number estimation ----
# target name accession query name accession E-value score bias E-value score bias exp reg clu ov env dom rep inc description of target
#------------------- ---------- -------------------- ---------- --------- ------ ----- --------- ------ ----- --- --- --- --- --- --- --- --- ---------------------
k141_2395_1 - nifH TIGR01287.1 1.7e-08 33.9 0.1 2.5e-08 33.4 0.1 1.2 1 0 0 1 1 1 1 # 81 # 650 # 1 # ID=7481_1;partial=01;start_type=GTG;rbs_motif=None;rbs_spacer=None;gc_cont=0.630
k141_11338_2 - nifH TIGR01287.1 2.2e-07 30.3 0.0 2.5e-07 30.1 0.0 1.1 1 0 0 1 1 1 1 # 273 # 569 # 1 # ID=9864_2;partial=01;start_type=GTG;rbs_motif=GGA/GAG/AGG;rbs_spacer=11-12bp;gc_cont=0.687
k141_19658_2
nifH detected but no nifD or nifK
could be present but not assembled
nifH is a marker gene and more abundant; also subject to HGT, operons or dormant
could screen for alternative nitrogenase genes: vnfDHK, anfDHK
HMM profiles may miss divergent versions of nifD/K if they’re evolutionarily distant from canonical forms
slightly more nif genes in the rhizosphere than bulk soil
detected taxa/top hits are known diazotrophs
Trim on Trimmomatic
Merge with fastp
Assemble with Megahit
Phylogeny with MEGAN
Functional analysis with Prodigal and HMMER
Get into PCAs/other analyses TBD
Run on Hyak and utilizing bash scripting to manage the large number of files.