Fungi profiling in gut Metagenomic data

Step 1 Investigation of fungi database & Fungi Genome Collection

RefSeq:RefSeq-bf, which contains only the compelete and curated genomes of fungi and bacteria.
JGI/MycoCosm:MycoCosm offers the largest available collection of fungal genomes, for comparative genomics across phylo- and eco- groups.
Ensembl Fungi: A majority of these are taken from the databases of INSDC.
Mycobank: Mycobank is an on-line database aimed as a service to the mycological and scientific community by documenting nomenclatural novelties and associated data.
FungiDB: Genome sources mainly from INSDC,FungiDB includes many fungal(and oomycete)species, including non-pathogens.
Broad Institute: Genome sources from NCBI
UNITE: fungal ITS sequences
Fungorum:Genomes download not available
DDBJ, EMBL-EBI and GenBank agreed to call their collaboration INSDC

An in-house shell script was used to gather all representative and reference genomes using the GenBank assembly_summary.txt file.   
As for JGI database, we utilized a command-line tool - [jgi-query](https://github.com/glarue/jgi-query). After downloading, we did md5 value check for all fungi genomes.

337 fungi genomes were downloaded from JGI;
by mapping metafiles of Ensembl and RefSeq,we downloaded their complementary set of fungi genomes.

Step 2 Principles of available software

FindFungi: downloading all fungal genomes from GenBank. 949 fungal genomes (32.4G)
MiCoP: Using the full NCBI RefSeq Viral and Fungal databases
hmmscan: database is based on the complete fungal genomes available at the NCBI website. 1213 entries - 66 fungal genomes & 38,000 entries (including ‘non-completed’), corresponding to 265 different fungal genomes
EukDetect: use a database of 521,824 universal marker genes from 241 conserved gene families,which we curated from 3713 fungal, protist, non-vertebrate metazoan, and non-streptophyte archaeplastida genomes and transcriptomes.
CCMetagen: uses NCBI taxonomic database (taxids)to produce ranked and updated taxonomic classifications [ready-to-use reference database NCBI nt and RefSeq]

Step 3 Fungi Genome qualification - Busco

Fungal sequences smaller than 1,100 nucleotide were discarded, amounting to 656 kb or 2% of the total. This conservative cut-off was used to avoid biases from poorly assembled short genomic sequences.

busco -i scaffolds.fasta -l $bacteria_odb10 -o SPEC1 -m genome -c 5

I. Complete and single-copy II.Complete and duplicated III. Fragmented IV. Missing
RefSeq: complete ：mean > 88% variance <5%
Ensembl: complete ：mean = 79.2% standard variance = 7.90709%
JGI: complete ：mean = 78.337% standard variance = 11.26374%

Step 4 Fungi Genome integration with standard Kraken database

why kraken: kraken is compatible with custom-made reference databases [krakenuniq gold standard] Kraken standard: archaea, bacteria, human, plasmid, UniVec_Core, fungi, viral, protozoa Fungi genome added : Genomes were modified to append Kraken taxid (NVBI taxon identification number) identifiers.

kraken2-build --download-library $DBNAME --db $DBNAME

for file in chr*.fa
do
    kraken2-build --add-to-library $file --db $DBNAME
done

kraken2-build --db full_db --build && bracken-build -d full_db

Step 5 Mock community construction based on CAMISIM to benchmark our procedure

we benchmark kraken with enlarged database using two batches of mock communities that were differed in sizes. Additionally, we include two case studies with real biological data to demonstrate that our procedure effectively produces a comprehensive overview of the eukaryotic members of micorbial communities. Besides, we also used the standard dataset from mOTU paper.

kraken fungi species level identification

------------------+------------------+------------------+--------------+--------------+
| FP rate         | Precision        | Recall           | F1 score     |  Sample Type |
==================+==================+==================+==============+==============+
|-10 0.529166667  | -10 0.23742779   | -10 0.299736081  |-10 0.2649682 |              |
|-100 0.250813953 |-100 0.162867647  | -100 0.03968254  |-100 0.063816 |Small (~200M) |
------------------+------------------+------------------+--------------+--------------+
|-10 0.912792336  | -10 0.045138571  | -10 0.311926606  |-10 0.07886   |              |
|-100 0.67927116  | -100 0.164476934 | -100 0.311926606 |-100 0.21538  | Big (~7.5G)  |
------------------+------------------+------------------+--------------+--------------+

Step 6 manual

kraken2 --db $full_db --report $line.full_report $data && bracken -d $full_db -i $data.full_report -o $data.full_bracken_file