R Notebook: Provides reproducible analysis for metagenome (MG) data in the following manuscript:
Citation: Romanowicz, KJ, Crump, BC, Kling, GW. (2021) Rainfall alters permafrost soil redox conditions, but meta-omics show divergent microbial community responses by tundra type in the arctic. Soil Systems 5(1): 17. https://doi.org/10.3390/soilsystems5010017
GitHub Repository: https://github.com/kromanowicz/2021-Romanowicz-SoilSystems
NCBI BioProject: https://www.ncbi.nlm.nih.gov/bioproject/PRJNA666429
Accepted for Publication: Soil Systems 10 March 2021
This R Notebook provides complete reproducibility of the data analysis in “Rainfall alters permafrost soil redox conditions, but meta-omics show divergent microbial community responses by tundra type in the Arctic” by Romanowicz, Crump, and Kling. In this experiment, mesocosms containing soil from the active layer of two dominant tundra types were subjected to simulated rainfall to alter redox conditions. The microbial functional potential (metagenomics) and gene expression (metatranscriptomics) patterns were measured during saturated anoxic redox conditions prior to rainfall and at multiple time points following the simulated rainfall event. Other measurements include soil properties as well as microbial respiration (CO2) and methane (CH4) production from soil subsamples collected at each sampling time point. The purpose was to determine if rainfall, as a form of soil oxidation, is sufficient to alter the anoxic redox conditions in arctic tundra and enhance the microbial degradation of organic carbon and CH4 to CO2.
Conceptual Figure. A total of 12 tundra mesocosms (3 replicates x 2 tundra types x 2 sets of response cores) were acclimated initially under anoxic redox conditions to mimic field conditions (T0). Dissolved oxygen was supplied to soils through the downward flow of oxygenated water during a simulated rainfall event. Dissolved oxygen will likely change the redox gradient directly following rainfall (T4) as a short-term effect. Anoxic conditions will likely be re-established after 24 hours (T24) as the pulse of oxygen is consumed through abiotic and biotic soil processes. Under anoxic redox conditions (T0), microorganisms likely degrade organic carbon through anaerobic and fermentation pathways, producing CH4 and reducing Fe(III) to Fe(II). Rainfall-induced soil oxidation (T4) should stimulate heterotrophic microorganisms that degrade organic carbon and CH4 through aerobic metabolic pathways, releasing CO2. Soil oxidation should also stimulate aerobic autotrophic iron oxidizing bacteria that oxidize Fe(II) to Fe(III) and convert CO2 into microbial biomass. The long-term response (T24) will likely be a combination of aerobic and anaerobic metabolism as well as a combination of reduction and oxidation iron reactions as dissolved oxygen is consumed. The predicted redox conditions and predicted redox reactions for coupled Fe(II)/Fe(III) cycling, as well as the microbial-induced release of CO2 or CH4 at each time point are based on the predicted availability of dissolved oxygen entering tundra soils through simulated rainfall.
Soil Sampling for Microbial Gene Expression
An initial soil sampling event for microbial activity was conducted at the end of the anoxic acclimation period (4-7 days) in all mesocosm replicates, representing sampling time point T0. Mesocosms were then flushed to simulate a rainfall event. Additional soil sampling events were conducted at T4 (4-hrs) and T24 (24-hrs) following the rainfall event to determine the temporal extent of microbial gene expression. Soil cores (2.54 cm diameter, 30 cm length) were extracted in duplicate from each mesocosm replicate at each sampling time point and homogenized by depth in 10-cm increments. The 10-20 cm soil increment, composed of organic soil in all mesocosm replicates, was chosen for microbial gene expression analysis and preserved in RNAlater Stabilization Reagent in sterile tubes at 4°C for 18 hours and then stored at -80°C until extraction.
Field Experiment. Tundra soil cores were collected from field sites in August 2017 (top left) and placed in buckets to establish the mesocosm experiment (bottom left). Tussock tundra cores were composed of an organic soil layer overlying a mineral soil layer (top middle) while wet sedge tundra cores were composed entirely of organic soil (bottom middle). Soil subsampling for microbial activity was taken from the 10-20 cm depth of duplicate soil cores in Tussock (top right) and Wet Sedge (bottom right).
# Make a vector of required packages
required.packages <- c("ape","cowplot","data.table","devtools","dplyr","DT","ggplot2","ggpubr","grid","gridExtra","kableExtra","knitr","pheatmap","png","RColorBrewer","reshape","rstatix","statmod","stringr","tibble","tidyr","tidyverse","vegan","yaml")
# Load required packages
lapply(required.packages, library, character.only = TRUE)
Extracted DNA was sequenced on the Illumina HiSeq 4000 platform (150bp paired-end reads) at the University of Michigan Advanced Genomics Core.
| ID | Sample | Concentration (ng/ul) | Volume (uL) | Fragment Length | i7 | i5 | Barcode i7 | Barcode i5 |
|---|---|---|---|---|---|---|---|---|
| S107104 | WS1_T0 | 10 | 40 | 500 | N701 | S517 | TAAGGCGA | TCTTACGC |
| S107105 | WS2_T0 | 10 | 40 | 500 | N702 | S517 | CGTACTAG | TCTTACGC |
| S107106 | WS3_T0 | 10 | 40 | 500 | N703 | S517 | AGGCAGAA | TCTTACGC |
| S107107 | Tuss1_T0 | 10 | 40 | 500 | N704 | S517 | TCCTGAGC | TCTTACGC |
| S107108 | Tuss2_T0 | 10 | 40 | 500 | N705 | S517 | GGACTCCT | TCTTACGC |
| S107110 | WS1_T4 | 10 | 40 | 500 | N701 | S502 | TAAGGCGA | ATAGAGAG |
| S107111 | WS2_T4 | 10 | 40 | 500 | N702 | S502 | CGTACTAG | ATAGAGAG |
| S107112 | WS3_T4 | 10 | 40 | 500 | N703 | S502 | AGGCAGAA | ATAGAGAG |
| S107113 | WS1_T24 | 10 | 40 | 500 | N704 | S502 | TCCTGAGC | ATAGAGAG |
| S107114 | WS2_T24 | 10 | 40 | 500 | N705 | S502 | GGACTCCT | ATAGAGAG |
| S107115 | WS3_T24 | 10 | 40 | 500 | N706 | S502 | TAGGCATG | ATAGAGAG |
| S107116 | Tuss1_T4 | 10 | 40 | 500 | N701 | S503 | TAAGGCGA | AGAGGATA |
| S107117 | Tuss2_T4 | 10 | 40 | 500 | N702 | S503 | CGTACTAG | AGAGGATA |
| S107119 | Tuss1_T24 | 10 | 40 | 500 | N704 | S503 | TCCTGAGC | AGAGGATA |
| S107121 | Tuss3_T24 | 10 | 40 | 500 | N706 | S503 | TAGGCATG | AGAGGATA |
| S107109 | Tuss3_T0 | 10 | 40 | 500 | N706 | S517 | TAGGCATG | TCTTACGC |
| S107118 | Tuss3_T4 | 10 | 40 | 500 | N703 | S503 | AGGCAGAA | AGAGGATA |
| S107120 | Tuss2_T24 | 10 | 40 | 500 | N705 | S503 | GGACTCCT | AGAGGATA |
NOTE: DNA Samples in RED were excluded from sequencing
Prep the Raw Reads (Hein: Comics – Omics Prep)
Use the comics container to prep the raw reads.
{Terminal}
cd /DNA_data/work
comics -- omics prep /DNA_data/Sample_1071*
DNA.data.raw.prep.pbs (“/DNA_data/work/”) 00:19 (hr:min)
.fastq files within each sample folder in the “DNA_data/work” directory as “Sample_107104_fwd.fastq”, “Sample_107104_rev.fastq”, etc.QC the Raw Reads (Smith: BBMap_QC)
{Terminal}
cd /DNA_data/work
bash bbmap_qc.sh > qc.log
DNA.data.bbmap.qc.pbs (“/DNA_data/work/”) 05:32 (hr:min)
Co-Assemble the QC Reads (Smith: MegaHit)
2.3.1 Concatenate the QC’d sample reads into a single fwd and rev file
{Terminal}
cd /DNA_data/work/co_assembly
cat *_adtrim_clean_qtrim_fwd.derep.fastq > 1071_DNA_data_fwd.derep.fastq
cat *_adtrim_clean_qtrim_rev.derep.fastq > 1071_DNA_data_rev.derep.fastq
DNA.data.qc.read.concatonate.pbs (“/DNA_data/work/”) 03:40 (hr:min)
2.3.2 Normalize Reads to 20X Coverage Prior to Assembly
{Terminal}
cd /DNA_data/work/co_assembly
comics bbnorm in=1071_DNA_data_fwd.derep.fastq in2=1071_DNA_data_rev.derep.fastq out=1071_DNA_data_bbnorm_fwd.derep.fastq out2=1071_DNA_data_bbnorm_rev.derep.fastq target=20 mindepth=5 threads=20 -Xmx425g &> bbnorm.log
DNA.data.bbnorm.pbs (“/DNA_data/work/”) 03:44 (hr:min)
2.3.3 Run the Assembly Module to Co-Assemble Contigs (Comics MegaHit)
MEGAHIT assembler to assemble reads into contigsMEGAHIT: k=255 is the maximum, which, with a step size of 6 or 8, is recommended unless severely limited by computational resources{Terminal}
cd /DNA_data/work/co_assembly
megahit --k-min 21 --k-max 255 --k-step 8 -1 1071_DNA_data_bbnorm_fwd.derep.fastq -2 1071_DNA_data_bbnorm_rev.derep.fastq -o ./megahit_out_bbnorm
DNA.data.bbnorm.megahit.coassembly.k255.pbs (“/DNA_data/work/”) 38:15 (hr:min)
2.3.4 Re-Format the Contigs File from the MegaHit Co-Assembly
MEGAHIT contained complex characters and spaces that were re-formatted prior to mapping
BAM files (generated from the mapper) matched those in the “final.contigs.fa” file PRIOR to importing into Anvi’o{Terminal}
cd /DNA_data/work/co_assembly/megahit_out_bbnorm
head final.contigs.fa
>k255_1 flag=1 multi=24.0000 len=301
CATGCCCTGGCGCTGCAGCGGATCTGGCAGCTGACGCCGGAACCGCAGCGGCCGGAGATCCGCGCGCGTCTGCAGGCGCTCTACCCGCGGCTGGTCAAGTGGCATCGCTATCTCGCCACCCAGCGCGACCCCGAGGCATCGGGACTGGTCACCGTATACCACCCGTGGGAGAGCACGGACAACTCCCCGCGCTGGGACCGCTCGCTGGCACGCATCGAGGTGGTCAAGCCCCGCCCGTACACCCGCCTGGACACCCGTCAGGTGCGGGACCCGTCGCAGCGGCCCACCGACTGGGACTACG
{Terminal}
cd /DNA_data/work/co_assembly/megahit_out_bbnorm
sed -e 's/\s/_/g' final.contigs.fa > final.contigs.fixed.bbnorm.fa
sed -e 's/=/_/g' final.contigs.fixed.bbnorm.fa > final.contigs.fixed2.bbnorm.fa
mv final.contigs.fixed2.bbnorm.fa final.contigs.fixed.bbnorm.fa
head final.contigs.fixed.bbnorm.fa
>k255_1_flag_1_multi_24.0000_len_301
CATGCCCTGGCGCTGCAGCGGATCTGGCAGCTGACGCCGGAACCGCAGCGGCCGGAGATCCGCGCGCGTCTGCAGGCGCTCTACCCGCGGCTGGTCAAGTGGCATCGCTATCTCGCCACCCAGCGCGACCCCGAGGCATCGGGACTGGTCACCGTATACCACCCGTGGGAGAGCACGGACAACTCCCCGCGCTGGGACCGCTCGCTGGCACGCATCGAGGTGGTCAAGCCCCGCCCGTACACCCGCCTGGACACCCGTCAGGTGCGGGACCCGTCGCAGCGGCCCACCGACTGGGACTACG
Index the Co-Assembled Contigs and Map (Hein: Omics Mapping)
{Terminal}
cd /DNA_data/work/co_assembly/megahit_out_bbnorm
comics -- omics mapping -a final.contigs.fixed.bbnorm.fa --index-only
DNA.data.bbnorm.index.contigs.pbs (“/DNA_data/work/”) 00:46 (hr:min)
2.4.1 Map Short Reads from Each Sample to the Contigs
{Terminal}
cd /DNA_data/work/co_assembly
comics -- omics mapping --index-dir /DNA_data/work/mapping/bowtie2-index_bbnorm/ -a final.contigs.fixed.bbnorm.fa -f Sample_107104_adtrim_clean_qtrim_fwd.derep.fastq -r Sample_107104_adtrim_clean_qtrim_rev.derep.fastq -o 107104_bbnorm_mapped_reads
comics -- omics mapping --index-dir /DNA_data/work/mapping/bowtie2-index_bbnorm/ -a final.contigs.fixed.bbnorm.fa -f Sample_107105_adtrim_clean_qtrim_fwd.derep.fastq -r Sample_107105_adtrim_clean_qtrim_rev.derep.fastq -o 107105_bbnorm_mapped_reads
comics -- omics mapping --index-dir /DNA_data/work/mapping/bowtie2-index_bbnorm/ -a final.contigs.fixed.bbnorm.fa -f Sample_107106_adtrim_clean_qtrim_fwd.derep.fastq -r Sample_107106_adtrim_clean_qtrim_rev.derep.fastq -o 107106_bbnorm_mapped_reads
comics -- omics mapping --index-dir /DNA_data/work/mapping/bowtie2-index_bbnorm/ -a final.contigs.fixed.bbnorm.fa -f Sample_107107_adtrim_clean_qtrim_fwd.derep.fastq -r Sample_107107_adtrim_clean_qtrim_rev.derep.fastq -o 107107_bbnorm_mapped_reads
comics -- omics mapping --index-dir /DNA_data/work/mapping/bowtie2-index_bbnorm/ -a final.contigs.fixed.bbnorm.fa -f Sample_107108_adtrim_clean_qtrim_fwd.derep.fastq -r Sample_107108_adtrim_clean_qtrim_rev.derep.fastq -o 107108_bbnorm_mapped_reads
comics -- omics mapping --index-dir /DNA_data/work/mapping/bowtie2-index_bbnorm/ -a final.contigs.fixed.bbnorm.fa -f Sample_107110_adtrim_clean_qtrim_fwd.derep.fastq -r Sample_107110_adtrim_clean_qtrim_rev.derep.fastq -o 107110_bbnorm_mapped_reads
comics -- omics mapping --index-dir /DNA_data/work/mapping/bowtie2-index_bbnorm/ -a final.contigs.fixed.bbnorm.fa -f Sample_107111_adtrim_clean_qtrim_fwd.derep.fastq -r Sample_107111_adtrim_clean_qtrim_rev.derep.fastq -o 107111_bbnorm_mapped_reads
comics -- omics mapping --index-dir /DNA_data/work/mapping/bowtie2-index_bbnorm/ -a final.contigs.fixed.bbnorm.fa -f Sample_107112_adtrim_clean_qtrim_fwd.derep.fastq -r Sample_107112_adtrim_clean_qtrim_rev.derep.fastq -o 107112_bbnorm_mapped_reads
comics -- omics mapping --index-dir /DNA_data/work/mapping/bowtie2-index_bbnorm/ -a final.contigs.fixed.bbnorm.fa -f Sample_107113_adtrim_clean_qtrim_fwd.derep.fastq -r Sample_107113_adtrim_clean_qtrim_rev.derep.fastq -o 107113_bbnorm_mapped_reads
comics -- omics mapping --index-dir /DNA_data/work/mapping/bowtie2-index_bbnorm/ -a final.contigs.fixed.bbnorm.fa -f Sample_107114_adtrim_clean_qtrim_fwd.derep.fastq -r Sample_107114_adtrim_clean_qtrim_rev.derep.fastq -o 107114_bbnorm_mapped_reads
comics -- omics mapping --index-dir /DNA_data/work/mapping/bowtie2-index_bbnorm/ -a final.contigs.fixed.bbnorm.fa -f Sample_107115_adtrim_clean_qtrim_fwd.derep.fastq -r Sample_107115_adtrim_clean_qtrim_rev.derep.fastq -o 107115_bbnorm_mapped_reads
comics -- omics mapping --index-dir /DNA_data/work/mapping/bowtie2-index_bbnorm/ -a final.contigs.fixed.bbnorm.fa -f Sample_107116_adtrim_clean_qtrim_fwd.derep.fastq -r Sample_107116_adtrim_clean_qtrim_rev.derep.fastq -o 107116_bbnorm_mapped_reads
comics -- omics mapping --index-dir /DNA_data/work/mapping/bowtie2-index_bbnorm/ -a final.contigs.fixed.bbnorm.fa -f Sample_107117_adtrim_clean_qtrim_fwd.derep.fastq -r Sample_107117_adtrim_clean_qtrim_rev.derep.fastq -o 107117_bbnorm_mapped_reads
comics -- omics mapping --index-dir /DNA_data/work/mapping/bowtie2-index_bbnorm/ -a final.contigs.fixed.bbnorm.fa -f Sample_107119_adtrim_clean_qtrim_fwd.derep.fastq -r Sample_107119_adtrim_clean_qtrim_rev.derep.fastq -o 107119_bbnorm_mapped_reads
comics -- omics mapping --index-dir /DNA_data/work/mapping/bowtie2-index_bbnorm/ -a final.contigs.fixed.bbnorm.fa -f Sample_107121_adtrim_clean_qtrim_fwd.derep.fastq -r Sample_107121_adtrim_clean_qtrim_rev.derep.fastq -o 107121_bbnorm_mapped_reads
DNA.data.bbnorm.short.read.mapping.pbs (“/DNA_data/work/”) 11:51 (hr:min)
sorted.bam file, which contains the indexCreate Anvi’o Database and Profiles (Smith: Anvi’o)
.bam files into an Anvi’o profile{Terminal}
cd /DNA_data/work/co_assembly
anvi-gen-contigs-database -f final.contigs.fixed.bbnorm.fa -o DNA.data.bbnorm.anvio.contigs.db -n DNA_data_bbnorm --split-length 10000 --kmer-size 4
DNA.data.bbnorm.anvio.contigs.database.pbs (“/DNA_data/work/”) 05:00 (hr:min)
2.5.1 Annotate Contigs Database – HMMS
{Terminal}
cd /DNA_data/work/co_assembly
anvi-run-hmms -c DNA_data_bbnorm_anvio_contigs.db --num-threads 20
DNA.data.bbnorm.anvio.contigs.hmms.pbs (“/DNA_data/work/”) 16:52 (hr:min)
2.5.2 Annotate Contigs Database – GhostKoala
2.5.2.1 First, exporte the amino acid sequences of the Anvi’o gene calls:
{Terminal}
cd /DNA_data/work/co_assembly
anvi-get-sequences-for-gene-calls -c DNA_data_bbnorm_anvio_contigs.db -o DNA_data_bbnorm_anvio_gene_calls.faa --get-aa-sequences
DNA.data.bbnorm.anvio.export.aa.seqs.pbs (“/DNA_data/work/”) 00:10 (hr:min)
2.5.2.2 GhostKoala cannot accept gene IDs that start with numbers, so the prefix “genecall” was added to all AA sequences as follows:
{Terminal}
sed 's/>/>genecall_/g' DNA_data_bbnorm_anvio_gene_calls.faa > DNA_data_bbnorm_anvio_gene_calls_edited.faa
mv DNA_data_bbnorm_anvio_gene_calls_edited.faa DNA_data_bbnorm_anvio_gene_calls.faa
#EXAMPLE
[co_assembly]$ head DNA_data_bbnorm_anvio_gene_calls.faa
>genecall_0
HALALQRIWQLTPEPQRPEIRARLQALYPRLVKWHRYLATQRDPEASGLVTVYHPWESTDNSPRWDRSLARIEVVKPRPYTRLDTRQVRDPSQRPTDWDY
2.5.2.3 Check the “DNA_data_bbnorm_anvio_gene_calls.faa” file for empty sequences and remove them prior to submitting to GhostKoala for annotation
{Terminal}
cd /DNA_data/work/co_assembly
tail DNA_data_bbnorm_anvio_gene_calls.faa
#EXAMPLE
[co_assembly]$ tail DNA_subset_anvio_gene_calls.faa
genecall_3917613
MRVXSXLRVNAGGVDKSCKSRGPARATGRRGSNTXVRPLXSGIASRKVGXFPIVPQGXRFFALGAACVVSASWXGNGLPRLRRLGELRGXPPPMELRHGPYSYGRLQTRIFRNGRKSDGA
TPRDRXSASVRKDLLSGTKFIDGTRXIRGSXLCASRSGNTEAPSITRNHWAXRVSRRPYXSLVKSVGSTCGLRGKRYGSRAXEVHGTHGGGVKSVDIMGNTKGEGNALVRSXRSNTKARV
ANGIRYPGSPGPQRCSLVLRSIDPLXDXANAVSEPPGXYDRKIKTQRNRRGLAQAVDYAAXFVGKRRTLPRLEIQLHPTXKGRSLRGCWTGDAWSSSVRGLNCSLQWGNERNPRCLLPIW
PGETAPSRERKVGMTPDQHVPLIPWAARIIQWSVQQVAMPXGGANPYKTDPSTDXGLQPTSXSRNRXXSRVSQTGVNTFSSLVLTARQVKGVGGARRTASRSTTANSMTRTKSXQGTGSG
SCSWNIS
>genecall_3917614
>genecall_3917615
{Terminal}
sed '/^>/ {N; /\n$/d}' DNA_data_bbnorm_anvio_gene_calls.faa > DNA_data_bbnorm_anvio_gene_calls_clean.faa
mv DNA_data_bbnorm_anvio_gene_calls_clean.faa DNA_data_bbnorm_anvio_gene_calls.faa
#EXAMPLE
[co_assembly]$ tail DNA_data_bbnorm_anvio_gene_calls.faa
>genecall_2337739
AHLTRLETRTKEFNMRASRWDEKPAGRVKASHSVTLCIGEIPSLVRKHEPGASPAHHKLTWLWWSLSVHVETRKMVNYAWAGXSQGKLWWRSVAILTCKSIVXPGYRGERLIEPSSSWFP
PKFPSGXLALGNAVLPGKANDXRPWGRNDLNLFSNFKWVRSPACLNAKLKSGRWIRAPSGPLLVSRTGAVGXTERXVMAPDETLMRSQKRCWLLXTAGRWPWKSEPAKECVTTHLPKQLA
LKMDGARASGLYSAVAAVNKXNSHAATSRRVAAVASKAYGREPGWSRRRCRSWWXXQILKREPXGLKWRRVPCEQQLNMGQSILRYRRKPFRSGVDSVIYWRWLLXLASRLFRSIPPHLP
KGNRVNIPEPGIGYDAGGGRAHLRHSCRATNAATQMNSETLTRALERVVFSSXGTGSLESDHPAIGRWCPXSTAVLAVSGAHVSVLETPRESVIFVPDRTHIRSRSPRXIASRPLEACRX
GKSAKWICNFGKKIGSEGWAGRAAVGSCRGFGGGPVLWLWSRRESAVYCGRARTSAGAXQWTAVAALPRGVFSVSATSFGAXQPTQNWHGPGESDCLIKTKHCDGWXSVLTQCDFCPVLX
MSKXRNSTKRGXTAGVTMTLLRXPNASSSNXXRAXMDQRDSHCPYLLSSETTAKGTGLAESAGKEDPVELDSSLTRXGDMRGVAXVGGSASCRQXNTTTFIVSLLIQXCGVWLLVARKNR
GAATILALTRTDLFRPRIRSDDSVRRGVXLGRYICQTVTQVSXGKLSEDRNLAXSKRAKACLILIFSTNTDRESVAYRSFXLXEFXARGVRKVTTGITGLWRPSVHSDVAFXSFDVGSSY
HCDAAVAKRRIVHPLIGNVSWVXTVVRQVSFTLLM
>genecall_2337745
QANKCMWWMPWRSQAMKDVVACEKSRGASKRALIREYPNGETRPARVTPGXIHRLGRANSVNXNISVTEGTEINRDSGSSGERTRSSQQVLALLLAKRSGKAGQSGXXPRTQKQWCGTRP
ATSRAGHVKSCLKMGGPSSKAKYSXSTDSEPVPXGKGEKNPGRGVKXILKPHAYKKSEPLLYXSSRVLSRSDARAARLKRLIDEIDPLVASLTXKIRVVLGIDDRSVGDALRVAQCGNTA
SLSRASKPXGLRSEGQXSRGDGVPFVXWVSDLHSVASLSESAGAGKPSPNRANELLGVDPKPDDLLMARMKVRXHALEVRTHXCXKIRGXAVGRGERLNKSGNSWFSPKTIXVVPHVLLL
GVEHCYGXGVMATYQTIANSEYWQVXAWETVTGCXRPVSRGKQPGPPAKVPKYSXVGNEVGRRKQSGGWLRSSHPLKKAXXLTDRVVLRGRCNGAQTRHRSXGC
2.5.2.4 GhostKoala also has a max file size limit of 300 MB, so files larger than the max file size were split into smaller files and the annotations were run in batches with the results concatenated prior to being added into Anvi’o.
To split the data into smaller files:
2.5.2.5 Submit the amino acid files for both functional and taxonomic annotation in KEGG GhostKoala
2.5.3 Import Functional & Taxonomic Annotations into the Contigs Database
2.5.3.1 Parse the functional annotations into a format readable by Anvi’o
{Terminal}
cd /DNA_data/work/co_assembly
python2 KEGG-to-anvio --KeggDB KO_Orthology_ko00001.txt -i user_ko.txt -o KeggAnnotations.txt
2.5.3.2 Import the functional annotations into the contigs database:
{Terminal}
cd /DNA_data/work/co_assembly
anvi-import-functions -c DNA_data_bbnorm_anvio_contigs.db -i KeggAnnotations.txt
#Output
Gene functions ...............................: 1389518 function calls from 1 sources for 1389518 unique gene calls has been added to the contigs database.
2.5.3.3 Parse the taxonomy annotations into a format readable by Anvi’o
{Terminal}
cd /DNA_subset/work/co_assembly
python GhostKOALA-taxonomy-to-anvio user.out.top KeggTaxonomy.txt
2.5.3.4 Import the taxonomic annotations into the contigs database
{Terminal}
cd /DNA_data/work/co_assembly
anvi-import-taxonomy-for-genes -c DNA_data_bbnorm_anvio_contigs.db -p default_matrix -i KeggTaxonomy.txt
#Output
Total num hits found .........................: 3,908,828
Num gene caller ids in the db ................: 3,917,616
Num gene caller ids in the incoming data .....: 3,908,828
Taxon names table ............................: Updated with 1672 unique taxon names
Genes taxonomy table .........................: Taxonomy stored for 3908828 gene calls
Splits taxonomy ..............................: Input data from "default_matrix" annotated 2458900 of 2492445 splits (98.7%) with taxonomy.
2.5.4 Import Sample Mapping Information into an Anvi’o Profile
BAM files are already sorted and indexed (i.e., for each “.bam” file there is also a “.bam.bai” file in the same directory) – Skip the initializing step and create an Anvi’o profile for the samples2.5.4.1 Move the following files to the “/DNA_data/work/mapping” directory
2.5.4.2 Import the mapping information (per Sample) into separate (per Sample) Anvi’o profiles
{Terminal}
cd /DNA_data/work/mapping
anvi-profile -i Sample_107104.final.contigs.fixed.bbnorm.sorted.bam -c /DNA_data/work/co_assembly/DNA_data_bbnorm_anvio_contigs.db -o anvio_107104_profile/ --sample-name Profile_107104 --min-contig-length 2500 --num-threads 20
anvi-profile -i Sample_107105.final.contigs.fixed.bbnorm.sorted.bam -c /DNA_data/work/co_assembly/DNA_data_bbnorm_anvio_contigs.db -o anvio_107105_profile/ --sample-name Profile_107105 --min-contig-length 2500 --num-threads 20
anvi-profile -i Sample_107106.final.contigs.fixed.bbnorm.sorted.bam -c /DNA_data/work/co_assembly/DNA_data_bbnorm_anvio_contigs.db -o anvio_107106_profile/ --sample-name Profile_107106 --min-contig-length 2500 --num-threads 20
anvi-profile -i Sample_107107.final.contigs.fixed.bbnorm.sorted.bam -c /DNA_data/work/co_assembly/DNA_data_bbnorm_anvio_contigs.db -o anvio_107107_profile/ --sample-name Profile_107107 --min-contig-length 2500 --num-threads 20
anvi-profile -i Sample_107108.final.contigs.fixed.bbnorm.sorted.bam -c /DNA_data/work/co_assembly/DNA_data_bbnorm_anvio_contigs.db -o anvio_107108_profile/ --sample-name Profile_107108 --min-contig-length 2500 --num-threads 20
anvi-profile -i Sample_107110.final.contigs.fixed.bbnorm.sorted.bam -c /DNA_data/work/co_assembly/DNA_data_bbnorm_anvio_contigs.db -o anvio_107110_profile/ --sample-name Profile_107110 --min-contig-length 2500 --num-threads 20
anvi-profile -i Sample_107111.final.contigs.fixed.bbnorm.sorted.bam -c /DNA_data/work/co_assembly/DNA_data_bbnorm_anvio_contigs.db -o anvio_107111_profile/ --sample-name Profile_107111 --min-contig-length 2500 --num-threads 20
anvi-profile -i Sample_107112.final.contigs.fixed.bbnorm.sorted.bam -c /DNA_data/work/co_assembly/DNA_data_bbnorm_anvio_contigs.db -o anvio_107112_profile/ --sample-name Profile_107112 --min-contig-length 2500 --num-threads 20
anvi-profile -i Sample_107113.final.contigs.fixed.bbnorm.sorted.bam -c /DNA_data/work/co_assembly/DNA_data_bbnorm_anvio_contigs.db -o anvio_107113_profile/ --sample-name Profile_107113 --min-contig-length 2500 --num-threads 20
anvi-profile -i Sample_107114.final.contigs.fixed.bbnorm.sorted.bam -c /DNA_data/work/co_assembly/DNA_data_bbnorm_anvio_contigs.db -o anvio_107114_profile/ --sample-name Profile_107114 --min-contig-length 2500 --num-threads 20
anvi-profile -i Sample_107115.final.contigs.fixed.bbnorm.sorted.bam -c /DNA_data/work/co_assembly/DNA_data_bbnorm_anvio_contigs.db -o anvio_107115_profile/ --sample-name Profile_107115 --min-contig-length 2500 --num-threads 20
anvi-profile -i Sample_107116.final.contigs.fixed.bbnorm.sorted.bam -c /DNA_data/work/co_assembly/DNA_data_bbnorm_anvio_contigs.db -o anvio_107116_profile/ --sample-name Profile_107116 --min-contig-length 2500 --num-threads 20
anvi-profile -i Sample_107117.final.contigs.fixed.bbnorm.sorted.bam -c /DNA_data/work/co_assembly/DNA_data_bbnorm_anvio_contigs.db -o anvio_107117_profile/ --sample-name Profile_107117 --min-contig-length 2500 --num-threads 20
anvi-profile -i Sample_107119.final.contigs.fixed.bbnorm.sorted.bam -c /DNA_data/work/co_assembly/DNA_data_bbnorm_anvio_contigs.db -o anvio_107119_profile/ --sample-name Profile_107119 --min-contig-length 2500 --num-threads 20
anvi-profile -i Sample_107121.final.contigs.fixed.bbnorm.sorted.bam -c /DNA_data/work/co_assembly/DNA_data_bbnorm_anvio_contigs.db -o anvio_107121_profile/ --sample-name Profile_107121 --min-contig-length 2500 --num-threads 20
DNA.data.bbnorm.anvio.mapping.profile.pbs (“/DNA_data/work/”) 05:16:51 (day:hr:min)
2.5.4.3 Merge the Sample profiles into a single Anvi’o profile database
{Terminal}
cd /DNA_data/work/mapping
anvi-merge anvio_107104_profile/PROFILE_107104.db anvio_107105_profile/PROFILE_107105.db anvio_107106_profile/PROFILE_107106.db anvio_107107_profile/PROFILE_107107.db anvio_107108_profile/PROFILE_107108.db anvio_107110_profile/PROFILE_107110.db anvio_107111_profile/PROFILE_107111.db anvio_107112_profile/PROFILE_107112.db anvio_107113_profile/PROFILE_107113.db anvio_107114_profile/PROFILE_107114.db anvio_107115_profile/PROFILE_107115.db anvio_107116_profile/PROFILE_107116.db anvio_107117_profile/PROFILE_107117.db anvio_107119_profile/PROFILE_107119.db anvio_107121_profile/PROFILE_107121.db -c /DNA_data/work/co_assembly/DNA_data_bbnorm_anvio_contigs.db -o merged_profile/
DNA.data.bbnorm.anvio.mapping.profile.merge.pbs (“/DNA_data/work/”) 01:45 (hr:min)
NOTE: After the profiles were merged, Anvi’o automatically started binning the contigs using CONCOCT. It also attempted to cluster the contigs using its own hierarchical clustering algorithm, but it did NOT generate bins from this clustering information; they were manually drawn later. The clustering algorithm was limited to datasets with 20,000 splits or fewer (which disqualifies most complex assemblies).
Binning Using Multiple Binning Software Algorithms (Smith: CONCOCT, MetaBAT, DASTool)
The next step is to export the split sequences and their differential coverage across all samples from the merged profile and use these splits in a series of binning algorithms external to Anvi’o.
2.6.1 Export Split Sequences and Differential Coverage Data for Binning
{Terminal}
cd /DNA_data/work/mapping
anvi-export-splits-and-coverages -p merged_profile/PROFILE.db -o merged_profile/ -c /DNA_data/work/co_assembly/DNA_data_bbnorm_anvio_contigs.db -O DNA_data_bbnorm_assembly --splits-mode
00:05 (hr:min)
2.6.2 Binning Splits with MetaBAT
{Terminal}
cd /DNA_data/work/mapping/merged_profile
comics -- metabat -i DNA_data_bbnorm_assembly-SPLITS.fa -o Bin -a DNA_data_bbnorm_assembly-COVs.txt --minContig 2500 --cvExt -t 20 --saveCls --onlyLabel -v
00:22 (hr:min)
2.6.2.1 Now make the “Metabat_binning_results.txt” file from the MetaBAT output to save it as a collection in Anvi’o
{Terminal}
cd /DNA_data/work/binning/metabat
sh /DNA_data/work/Metabat_to_anvio_parser.sh
00:01 (hr:min)
2.6.2.2 Next, import this information into Anvi’o, saving it as the MetaBAT bin collection:
{Terminal}
cd /DNA_data/work/binning/metabat
anvi-import-collection -c /DNA_data/work/co_assembly/DNA_data_bbnorm_anvio_contigs.db -p /DNA_data/work/mapping/merged_profile/PROFILE.db -C METABAT Metabat_binning_results.txt
#Output
Item names in input ..........................: 112,957
Num bins in input ............................: 165
Items in profile database ....................: 131,074
Item names shared between input and db .......: 112,957
WARNING
===============================================
18117 item(s) that were in the database, but were not in the input file, will
not be described by any bin in the collection METABAT. That is totally fine, but
anvi'o hopes that you are aware of that. This means you have more things in your
database than the number of things your input file describes. Here is an example
of something that is in your database but not in any bin in your input file:
k255_691935_flag_1_multi_13.0000_len_2643_split_00001.
Collections ..................................: The collection "METABAT" that describes 112,957 splits and 165 bins has been successfully added to the database at "/DNA_data/work/mapping/merged_profile/PROFILE.db". Here is a list of the first 50 bin names in this collection: Bin_1,Bin_10,Bin_100,Bin_101,Bin_102,Bin_103,Bin_104,Bin_105,Bin_106,Bin_107,Bin_108,Bin_109,Bin_11,Bin_110,Bin_111,Bin_112,Bin_113,Bin_114,Bin_115,Bin_116,Bin_117,Bin_118,Bin_119,Bin_12,Bin_120,Bin_121,Bin_122,Bin_123,Bin_124,Bin_125,Bin_126,Bin_127,Bin_128,Bin_129,Bin_13,Bin_130,Bin_131,Bin_132,Bin_133,Bin_134,Bin_135,Bin_136,Bin_137,Bin_138,Bin_139,Bin_14,Bin_140,Bin_141,Bin_142,Bin_143
00:01 (hr:min)
2.6.3 Enter this command to view the different collections in the profile database:
{Terminal}
cd /DNA_data/work/mapping/merged_profile
anvi-show-collections-and-bins -p PROFILE.db
#Output
Collection: "CONCOCT"
===============================================
Collection ID ................................: CONCOCT
Number of bins ...............................: 74
Number of splits described ...................: 131,074
Bin names ....................................: Bin_1, Bin_10, Bin_11, Bin_12, Bin_13, Bin_14, Bin_15, Bin_16, Bin_17, Bin_18, Bin_19, Bin_2, Bin_20, Bin_21, Bin_22, Bin_23, Bin_24, Bin_25, Bin_26, Bin_27, Bin_28, Bin_29, Bin_3, Bin_30, Bin_31, Bin_32, Bin_33, Bin_34, Bin_35, Bin_36, Bin_37, Bin_38, Bin_39, Bin_4, Bin_40, Bin_41, Bin_42, Bin_43, Bin_44, Bin_45, Bin_46, Bin_47, Bin_48, Bin_49, Bin_5, Bin_50, Bin_51, Bin_52, Bin_53, Bin_54, Bin_55, Bin_56, Bin_57, Bin_58, Bin_59, Bin_6, Bin_60, Bin_61, Bin_62, Bin_63, Bin_64, Bin_65, Bin_66, Bin_67, Bin_68, Bin_69, Bin_7, Bin_70, Bin_71, Bin_72, Bin_73, Bin_74, Bin_8, Bin_9
Collection: "METABAT"
===============================================
Collection ID ................................: METABAT
Number of bins ...............................: 165
Number of splits described ...................: 112,957
Bin names ....................................: Bin_1, Bin_10, Bin_100, Bin_101, Bin_102, Bin_103, Bin_104, Bin_105, Bin_106, Bin_107, Bin_108, Bin_109, Bin_11, Bin_110, Bin_111, Bin_112, Bin_113, Bin_114, Bin_115, Bin_116, Bin_117, Bin_118, Bin_119, Bin_12, Bin_120, Bin_121, Bin_122, Bin_123, Bin_124, Bin_125, Bin_126, Bin_127, Bin_128, Bin_129, Bin_13, Bin_130, Bin_131, Bin_132, Bin_133, Bin_134, Bin_135, Bin_136, Bin_137, Bin_138, Bin_139, Bin_14, Bin_140, Bin_141, Bin_142, Bin_143, Bin_144, Bin_145, Bin_146, Bin_147, Bin_148, Bin_149, Bin_15, Bin_150, Bin_151, Bin_152, Bin_153, Bin_154, Bin_155, Bin_156, Bin_157, Bin_158, Bin_159, Bin_16, Bin_160, Bin_161, Bin_162, Bin_163, Bin_164, Bin_165, Bin_17, Bin_18, Bin_19, Bin_2, Bin_20, Bin_21, Bin_22, Bin_23, Bin_24, Bin_25, Bin_26, Bin_27, Bin_28, Bin_29, Bin_3, Bin_30, Bin_31, Bin_32, Bin_33, Bin_34, Bin_35, Bin_36, Bin_37, Bin_38, Bin_39, Bin_4, Bin_40, Bin_41, Bin_42, Bin_43, Bin_44, Bin_45, Bin_46, Bin_47, Bin_48, Bin_49, Bin_5, Bin_50, Bin_51, Bin_52, Bin_53, Bin_54, Bin_55, Bin_56, Bin_57, Bin_58, Bin_59, Bin_6, Bin_60, Bin_61, Bin_62, Bin_63, Bin_64, Bin_65, Bin_66, Bin_67, Bin_68, Bin_69, Bin_7, Bin_70, Bin_71, Bin_72, Bin_73, Bin_74, Bin_75, Bin_76, Bin_77, Bin_78, Bin_79, Bin_8, Bin_80, Bin_81, Bin_82, Bin_83, Bin_84, Bin_85, Bin_86, Bin_87, Bin_88, Bin_89, Bin_9, Bin_90, Bin_91, Bin_92, Bin_93, Bin_94, Bin_95, Bin_96, Bin_97, Bin_98, Bin_99
2.6.4 DASTools
{Terminal}
cd /DNA_data/work/binning/concoct
anvi-export-collection -p /DNA_data/work/mapping/merged_profile/PROFILE.db -C CONCOCT -O CONCOCT_binning_results
{Terminal}
cd /DNA_data/work/binning/dastool
comics das-tool -i CONCOCT_binning_results.txt,Metabat_binning_results.txt -l concoct,metabat -c /DNA_data/work/mapping/merged_profile/DNA_data_bbnorm_assembly-SPLITS.fa -o DASTool_bins --search_engine blast --threads 20
00:20 (hr:min)
{Terminal}
sed 's/results\./results_/g' DASTool_bins_DASTool_scaffolds2bin.txt > DASTool_binning_results.txt
sed 's/updated\./updated_/g' DASTool_binning_results.txt > DASTool_binning_results2.txt
mv DASTool_binning_results2.txt DASTool_binning_results.txt
{Terminal}
cd /DNA_data/work/binning/dastool
anvi-import-collection -c /DNA_data/work/co_assembly/DNA_data_bbnorm_anvio_contigs.db -p /DNA_data/work/mapping/merged_profile/PROFILE.db -C DASTool DASTool_binning_results.txt
#Output DASTool
Item names in input ..........................: 31,479
Num bins in input ............................: 58
Items in profile database ....................: 131,074
Item names shared between input and db .......: 31,479
WARNING
===============================================
99595 item(s) that were in the database, but were not in the input file, will
not be described by any bin in the collection DASTool. That is totally fine, but
anvi'o hopes that you are aware of that. This means you have more things in your
database than the number of things your input file describes. Here is an example
of something that is in your database but not in any bin in your input file:
k255_2221878_flag_1_multi_9.0000_len_4003_split_00001.
Collections ..................................: The collection "DASTool" that describes 31,479 splits and 58 bins has been successfully added to the database at "/DNA_data/work/mapping/merged_profile/PROFILE.db". Here is a list of the first 50 bin names in this collection: Metabat_binning_results_011,Metabat_binning_results_090,Metabat_binning_results_136,Metabat_binning_results_071,Metabat_binning_results_027,Metabat_binning_results_163,Metabat_binning_results_142,Metabat_binning_results_074,Metabat_binning_results_031,Metabat_binning_results_106,Metabat_binning_results_056,Metabat_binning_results_089,Metabat_binning_results_078,CONCOCT_binning_results_070,Metabat_binning_results_046,CONCOCT_binning_results_062,Metabat_binning_results_064,Metabat_binning_results_110,Metabat_binning_results_083,Metabat_binning_results_159,Metabat_binning_results_132,Metabat_binning_results_055,Metabat_binning_results_073,Metabat_binning_results_080,Metabat_binning_results_160,Metabat_binning_results_081,CONCOCT_binning_results_023,Metabat_binning_results_044,Metabat_binning_results_153,CONCOCT_binning_results_060,Metabat_binning_results_161,Metabat_binning_results_148,Metabat_binning_results_060,Metabat_binning_results_068,Metabat_binning_results_137,Metabat_binning_results_032,Metabat_binning_results_145,Metabat_binning_results_033,Metabat_binning_results_165,CONCOCT_binning_results_053,CONCOCT_binning_results_055,Metabat_binning_results_029,Metabat_binning_results_036,Metabat_binning_results_088,Metabat_binning_results_069,CONCOCT_binning_results_056,Metabat_binning_results_014,Metabat_binning_results_024,Metabat_binning_results_023,CONCOCT_binning_results_071.
{Terminal}
cd /DNA_data/work/binning/dastool
anvi-script-get-collection-info -c /DNA_data/work/co_assembly/DNA_data_bbnorm_anvio_contigs.db -p /DNA_data/work/mapping/merged_profile/PROFILE.db -C DASTool -o DNA_data_bbnorm_DASTool_summary.txt
#Output
Contigs DB ...................................: Initialized: /DNA_data/work/co_assembly/DNA_data_bbnorm_anvio_contigs.db (v. 12)
* Completion and redundancy estimates. PC: Percent completion; PR: Percent
redundancy; N: Number of splits; L: Length (total number of nucleotides); D:
Domain for single-copy core genes; C: Domain confidence.
Bins in collection "DASTool"
===============================================
Metabat_binning_results_011 :: PC: 93.53%, PR: 3.60%, N: 352, L: 3,138,665, D: bacteria (C: 0.90)
Metabat_binning_results_090 :: PC: 98.56%, PR: 6.47%, N: 1,023, L: 6,681,655, D: bacteria (C: 0.92)
Metabat_binning_results_136 :: PC: 91.36%, PR: 10.49%, N: 221, L: 1,613,215, D: archaea (C: 0.81)
Metabat_binning_results_071 :: PC: 97.12%, PR: 0.72%, N: 378, L: 2,723,137, D: bacteria (C: 0.96)
Metabat_binning_results_027 :: PC: 88.49%, PR: 3.60%, N: 382, L: 2,641,710, D: bacteria (C: 0.85)
Metabat_binning_results_163 :: PC: 89.93%, PR: 6.47%, N: 1,093, L: 5,731,451, D: bacteria (C: 0.83)
Metabat_binning_results_142 :: PC: 80.58%, PR: 3.60%, N: 575, L: 3,918,295, D: bacteria (C: 0.77)
Metabat_binning_results_074 :: PC: 89.93%, PR: 10.07%, N: 758, L: 3,672,582, D: bacteria (C: 0.80)
Metabat_binning_results_031 :: PC: 92.81%, PR: 9.35%, N: 540, L: 2,714,627, D: bacteria (C: 0.83)
Metabat_binning_results_106 :: PC: 87.77%, PR: 2.16%, N: 426, L: 2,639,223, D: bacteria (C: 0.86)
Metabat_binning_results_056 :: PC: 90.65%, PR: 3.60%, N: 295, L: 2,133,681, D: bacteria (C: 0.87)
Metabat_binning_results_089 :: PC: 95.68%, PR: 2.88%, N: 622, L: 4,758,373, D: bacteria (C: 0.93)
Metabat_binning_results_078 :: PC: 84.89%, PR: 5.76%, N: 463, L: 3,046,540, D: bacteria (C: 0.79)
CONCOCT_binning_results_070 :: PC: 82.73%, PR: 2.16%, N: 286, L: 1,851,259, D: bacteria (C: 0.81)
Metabat_binning_results_046 :: PC: 88.89%, PR: 9.26%, N: 206, L: 1,536,975, D: archaea (C: 0.80)
CONCOCT_binning_results_062 :: PC: 82.01%, PR: 0.72%, N: 160, L: 1,468,490, D: bacteria (C: 0.81)
Metabat_binning_results_064 :: PC: 94.96%, PR: 5.04%, N: 537, L: 3,258,934, D: bacteria (C: 0.90)
Metabat_binning_results_110 :: PC: 74.82%, PR: 7.91%, N: 298, L: 1,408,896, D: bacteria (C: 0.67)
Metabat_binning_results_083 :: PC: 82.01%, PR: 6.47%, N: 331, L: 2,095,666, D: bacteria (C: 0.76)
Metabat_binning_results_159 :: PC: 90.65%, PR: 9.35%, N: 967, L: 5,724,953, D: bacteria (C: 0.81)
Metabat_binning_results_132 :: PC: 85.61%, PR: 5.04%, N: 939, L: 6,046,501, D: bacteria (C: 0.81)
Metabat_binning_results_055 :: PC: 83.45%, PR: 10.07%, N: 882, L: 6,560,778, D: bacteria (C: 0.73)
Metabat_binning_results_073 :: PC: 76.98%, PR: 5.04%, N: 375, L: 2,246,106, D: bacteria (C: 0.72)
Metabat_binning_results_080 :: PC: 71.94%, PR: 3.60%, N: 425, L: 2,414,552, D: bacteria (C: 0.68)
Metabat_binning_results_160 :: PC: 81.29%, PR: 12.95%, N: 1,198, L: 6,308,300, D: bacteria (C: 0.68)
Metabat_binning_results_081 :: PC: 92.09%, PR: 28.78%, N: 410, L: 2,897,091, D: bacteria (C: 0.63)
CONCOCT_binning_results_023 :: PC: 82.01%, PR: 4.32%, N: 251, L: 2,614,541, D: bacteria (C: 0.78)
Metabat_binning_results_044 :: PC: 74.10%, PR: 5.04%, N: 921, L: 4,823,662, D: bacteria (C: 0.69)
Metabat_binning_results_153 :: PC: 95.68%, PR: 23.02%, N: 500, L: 4,519,044, D: bacteria (C: 0.73)
CONCOCT_binning_results_060 :: PC: 64.03%, PR: 3.60%, N: 1,040, L: 6,178,783, D: bacteria (C: 0.60)
Metabat_binning_results_161 :: PC: 74.82%, PR: 7.19%, N: 578, L: 2,580,286, D: bacteria (C: 0.68)
Metabat_binning_results_148 :: PC: 74.10%, PR: 2.16%, N: 629, L: 2,786,980, D: bacteria (C: 0.72)
Metabat_binning_results_060 :: PC: 74.82%, PR: 10.79%, N: 1,144, L: 7,848,862, D: bacteria (C: 0.64)
Metabat_binning_results_068 :: PC: 95.68%, PR: 20.14%, N: 315, L: 2,011,285, D: bacteria (C: 0.76)
Metabat_binning_results_137 :: PC: 78.42%, PR: 5.04%, N: 627, L: 2,870,570, D: bacteria (C: 0.73)
Metabat_binning_results_032 :: PC: 83.45%, PR: 11.51%, N: 403, L: 2,158,177, D: bacteria (C: 0.72)
Metabat_binning_results_145 :: PC: 61.15%, PR: 0.72%, N: 292, L: 1,433,337, D: bacteria (C: 0.60)
Metabat_binning_results_033 :: PC: 65.47%, PR: 12.23%, N: 447, L: 1,848,022, D: bacteria (C: 0.53)
Metabat_binning_results_165 :: PC: 84.17%, PR: 9.35%, N: 742, L: 5,587,269, D: bacteria (C: 0.75)
CONCOCT_binning_results_053 :: PC: 76.26%, PR: 5.04%, N: 690, L: 3,224,356, D: bacteria (C: 0.71)
CONCOCT_binning_results_055 :: PC: 51.08%, PR: 1.44%, N: 639, L: 2,558,984, D: bacteria (C: 0.50)
Metabat_binning_results_029 :: PC: 72.66%, PR: 5.04%, N: 431, L: 2,187,888, D: bacteria (C: 0.68)
Metabat_binning_results_036 :: PC: 97.12%, PR: 28.06%, N: 940, L: 5,554,480, D: bacteria (C: 0.69)
Metabat_binning_results_088 :: PC: 47.48%, PR: 0.72%, N: 137, L: 581,347, D: bacteria (C: 0.47)
Metabat_binning_results_069 :: PC: 55.40%, PR: 0.00%, N: 208, L: 1,986,284, D: bacteria (C: 0.55)
CONCOCT_binning_results_056 :: PC: 69.06%, PR: 5.04%, N: 1,111, L: 5,337,967, D: bacteria (C: 0.64)
Metabat_binning_results_014 :: PC: 65.47%, PR: 5.76%, N: 445, L: 1,921,245, D: bacteria (C: 0.60)
Metabat_binning_results_024 :: PC: 61.15%, PR: 4.32%, N: 623, L: 2,618,168, D: bacteria (C: 0.57)
Metabat_binning_results_023 :: PC: 56.12%, PR: 2.16%, N: 441, L: 2,195,523, D: bacteria (C: 0.54)
CONCOCT_binning_results_071 :: PC: 54.68%, PR: 4.32%, N: 228, L: 1,652,688, D: bacteria (C: 0.50)
Metabat_binning_results_070 :: PC: 62.59%, PR: 2.88%, N: 435, L: 1,788,235, D: bacteria (C: 0.60)
Metabat_binning_results_098 :: PC: 37.41%, PR: 2.88%, N: 536, L: 1,991,384, D: bacteria (C: 0.35)
Metabat_binning_results_051 :: PC: 83.45%, PR: 7.91%, N: 248, L: 1,789,108, D: bacteria (C: 0.76)
Metabat_binning_results_164 :: PC: 60.43%, PR: 1.44%, N: 338, L: 1,807,394, D: bacteria (C: 0.59)
Metabat_binning_results_012 :: PC: 43.88%, PR: 5.76%, N: 173, L: 774,389, D: bacteria (C: 0.38)
Metabat_binning_results_155 :: PC: 49.64%, PR: 3.60%, N: 671, L: 2,670,480, D: bacteria (C: 0.46)
Metabat_binning_results_043 :: PC: 38.85%, PR: 2.16%, N: 474, L: 1,863,523, D: bacteria (C: 0.37)
Metabat_binning_results_085 :: PC: 66.19%, PR: 10.79%, N: 680, L: 4,255,822, D: bacteria (C: 0.55)
Output .......................................: DNA_data_bbnorm_DASTool_summary.txt
{Terminal}
Bins in collection "DASTool" to keep (PC > 70% | PR < 10%) -- 30 Bins
==========================================================
Metabat_binning_results_011 :: PC: 93.53%, PR: 3.60%, N: 352, L: 3,138,665, D: bacteria (C: 0.90)
Metabat_binning_results_090 :: PC: 98.56%, PR: 6.47%, N: 1,023, L: 6,681,655, D: bacteria (C: 0.92)
Metabat_binning_results_071 :: PC: 97.12%, PR: 0.72%, N: 378, L: 2,723,137, D: bacteria (C: 0.96)
Metabat_binning_results_027 :: PC: 88.49%, PR: 3.60%, N: 382, L: 2,641,710, D: bacteria (C: 0.85)
Metabat_binning_results_163 :: PC: 89.93%, PR: 6.47%, N: 1,093, L: 5,731,451, D: bacteria (C: 0.83)
Metabat_binning_results_142 :: PC: 80.58%, PR: 3.60%, N: 575, L: 3,918,295, D: bacteria (C: 0.77)
Metabat_binning_results_031 :: PC: 92.81%, PR: 9.35%, N: 540, L: 2,714,627, D: bacteria (C: 0.83)
Metabat_binning_results_106 :: PC: 87.77%, PR: 2.16%, N: 426, L: 2,639,223, D: bacteria (C: 0.86)
Metabat_binning_results_056 :: PC: 90.65%, PR: 3.60%, N: 295, L: 2,133,681, D: bacteria (C: 0.87)
Metabat_binning_results_089 :: PC: 95.68%, PR: 2.88%, N: 622, L: 4,758,373, D: bacteria (C: 0.93)
Metabat_binning_results_078 :: PC: 84.89%, PR: 5.76%, N: 463, L: 3,046,540, D: bacteria (C: 0.79)
CONCOCT_binning_results_070 :: PC: 82.73%, PR: 2.16%, N: 286, L: 1,851,259, D: bacteria (C: 0.81)
Metabat_binning_results_046 :: PC: 88.89%, PR: 9.26%, N: 206, L: 1,536,975, D: archaea (C: 0.80)
CONCOCT_binning_results_062 :: PC: 82.01%, PR: 0.72%, N: 160, L: 1,468,490, D: bacteria (C: 0.81)
Metabat_binning_results_064 :: PC: 94.96%, PR: 5.04%, N: 537, L: 3,258,934, D: bacteria (C: 0.90)
Metabat_binning_results_110 :: PC: 74.82%, PR: 7.91%, N: 298, L: 1,408,896, D: bacteria (C: 0.67)
Metabat_binning_results_083 :: PC: 82.01%, PR: 6.47%, N: 331, L: 2,095,666, D: bacteria (C: 0.76)
Metabat_binning_results_159 :: PC: 90.65%, PR: 9.35%, N: 967, L: 5,724,953, D: bacteria (C: 0.81)
Metabat_binning_results_132 :: PC: 85.61%, PR: 5.04%, N: 939, L: 6,046,501, D: bacteria (C: 0.81)
Metabat_binning_results_073 :: PC: 76.98%, PR: 5.04%, N: 375, L: 2,246,106, D: bacteria (C: 0.72)
Metabat_binning_results_080 :: PC: 71.94%, PR: 3.60%, N: 425, L: 2,414,552, D: bacteria (C: 0.68)
CONCOCT_binning_results_023 :: PC: 82.01%, PR: 4.32%, N: 251, L: 2,614,541, D: bacteria (C: 0.78)
Metabat_binning_results_044 :: PC: 74.10%, PR: 5.04%, N: 921, L: 4,823,662, D: bacteria (C: 0.69)
Metabat_binning_results_161 :: PC: 74.82%, PR: 7.19%, N: 578, L: 2,580,286, D: bacteria (C: 0.68)
Metabat_binning_results_148 :: PC: 74.10%, PR: 2.16%, N: 629, L: 2,786,980, D: bacteria (C: 0.72)
Metabat_binning_results_137 :: PC: 78.42%, PR: 5.04%, N: 627, L: 2,870,570, D: bacteria (C: 0.73)
Metabat_binning_results_165 :: PC: 84.17%, PR: 9.35%, N: 742, L: 5,587,269, D: bacteria (C: 0.75)
CONCOCT_binning_results_053 :: PC: 76.26%, PR: 5.04%, N: 690, L: 3,224,356, D: bacteria (C: 0.71)
Metabat_binning_results_029 :: PC: 72.66%, PR: 5.04%, N: 431, L: 2,187,888, D: bacteria (C: 0.68)
Metabat_binning_results_051 :: PC: 83.45%, PR: 7.91%, N: 248, L: 1,789,108, D: bacteria (C: 0.76)
2.6.5 DASTools MAGs Taxonomy
{Terminal}
# Downloading taxonomy search databases (only need to do once)
anvi-setup-scg-databases
# Update contigs database from v12 to v14 (necessary for assigning taxonomy)
anvi-migrate /2017_Redox/MAGs_DNA/DNA_data_bbnorm_anvio_contigs.db
# Update profile from v30 to v31 (necessary for assigning taxonomy)
anvi-migrate /2017_Redox/MAGs_DNA/PROFILE.db
# Populate tables with HMM hits (since they will be removed when updating the contigs db from v12 to v14 -- see above)
anvi-run-hmms -c /2017_Redox/MAGs_DNA/DNA_data_bbnorm_anvio_contigs.db
# Populating contigs db with SCG taxonomy
anvi-run-scg-taxonomy -c /2017_Redox/MAGs_DNA/DNA_data_bbnorm_anvio_contigs.db
# Estimate taxonomy based on single-copy core genes (SCG) within each MAG
anvi-estimate-scg-taxonomy -c /2017_Redox/MAGs_DNA/DNA_data_bbnorm_anvio_contigs.db \
-p /2017_Redox/MAGs_DNA/PROFILE.db \
-C DASTool \
--compute-scg-coverages \
--output-file /MAGs.taxa.output.txt
### Output
Contigs DB ...................................: /gpfs/accounts/gwk_root/gwk1/kjromano/2017_Redox/MAGs_DNA/DNA_data_bbnorm_anvio_contigs.db
Profile DB ...................................: /gpfs/accounts/gwk_root/gwk1/kjromano/2017_Redox/MAGs_DNA/PROFILE.db
Metagenome mode ..............................: False
WARNING
===============================================
Please note that anvi'o found 2,464,202 splits in your contigs database. But
only 131,074 of them appeared to be in the profile database. As a result, anvi'o
will now remove the 2,333,128 splits that occur only in the contigs db from all
downstream analyses here (if you didn't use the flag `--compute-scg-coverages`
this wouldn't have been necessary, but with the current settings this is really
the best for everyone). Where is this difference coming from though? Well. This
is often the case because the 'minimum contig length parameter' set during the
`anvi-profile` step can exclude many contigs from downstream analyses (often for
good reasons, too). For instance, in your case the minimum contig length goes as
low as 256 nts in your contigs database. Yet, the minimum contig length set in
the profile databaes is 2,500 nts. Hence the difference. Anvi'o hopes that this
explaines some things.
* 31,479 split names associated with 58 bins of in collection 'DASTool' have been
successfully recovered 🎊
* Anvi'o will now attempt to recover SCG coverages in GENOME MODE from the profile
database, which contains 15 samples.
FRIENDLY REMINDER
===============================================
Anvi'o has just finished recovering SCG coverages from the profile database to
estimate the average coverage of your bins across your samples. Please note that
anvi'o SCG taxonomy framework is using only %d SCGs to estimate taxonomy. Which
means, even a highly complete bin may be missing all of them. In which case, the
coverage of that bin will be `None` across all your samples. The best way to
prevent any misleading insights is take these results with a huge grain of salt,
and use the `anvi-summarize` output for critical applications.
Estimated taxonomy for collection "DASTool"
===============================================
Output file ..................................: /MAGs.taxa.output.txt
✓ anvi-estimate-scg-taxonomy took 0:02:23.726746
The following table is a summary of the DNA reads by sample and includes QC Reads, Mapped Reads, and Average Read Length (bp). All values were derived from the “Sample_*.final.contigs.fixed.bbnorm.sorted.bam” files using samtools stats.
| ID | Tundra | Time | Replicate | QC Reads | Mapped Reads | Average Length (bp) |
|---|---|---|---|---|---|---|
| S107107 | Tussock | T0 | 1 | 65,667,254 | 19,628,380 | 141 |
| S107108 | Tussock | T0 | 2 | 43,020,594 | 12,528,861 | 141 |
| S107116 | Tussock | T4 | 1 | 12,312,448 | 3,372,508 | 141 |
| S107117 | Tussock | T4 | 2 | 36,565,234 | 10,628,986 | 141 |
| S107119 | Tussock | T24 | 1 | 47,853,114 | 11,451,168 | 141 |
| S107121 | Tussock | T24 | 3 | 27,632,452 | 4,767,927 | 141 |
| S107104 | Wet Sedge | T0 | 1 | 34,568,010 | 15,458,362 | 139 |
| S107105 | Wet Sedge | T0 | 2 | 24,408,724 | 12,419,993 | 140 |
| S107106 | Wet Sedge | T0 | 3 | 25,734,316 | 7,180,169 | 141 |
| S107110 | Wet Sedge | T4 | 1 | 29,338,238 | 13,910,613 | 140 |
| S107111 | Wet Sedge | T4 | 2 | 41,991,856 | 17,433,511 | 141 |
| S107112 | Wet Sedge | T4 | 3 | 35,302,148 | 8,871,426 | 141 |
| S107113 | Wet Sedge | T24 | 1 | 36,956,594 | 14,967,853 | 141 |
| S107114 | Wet Sedge | T24 | 2 | 33,190,654 | 15,997,334 | 140 |
| S107115 | Wet Sedge | T24 | 3 | 57,805,778 | 12,782,304 | 141 |
Following all Data Bioinformatics steps above, quality-controlled metagenome reads were extracted using BBMap to generate pile-up files (average read depth per gene), and SAMtools was used to extract counts and CDS lengths from the BBMap output.
In this section, all DNA.pileup files are imported into the R environment and combined into the first FULL dataset to be further formatted for downstream statistical analysis.
The following DNA.pileup files (average read depth per gene) for each metagenome sample were generated via the BBMap module.
#S107104 -- WS_T0_R1
pileup_107104<-read.table("Pileup.Data/DNA.Pileup/Sample_107104_DNA.pileup", header = TRUE)
pileup_107104$S107104 <- pileup_107104$Plus_reads + pileup_107104$Minus_reads
#S107105 -- WS_T0_R2
pileup_107105<-read.table("Pileup.Data/DNA.Pileup/Sample_107105_DNA.pileup", header = TRUE)
pileup_107105$S107105 <- pileup_107105$Plus_reads + pileup_107105$Minus_reads
#S107106 -- WS_T0_R3
pileup_107106<-read.table("Pileup.Data/DNA.Pileup/Sample_107106_DNA.pileup", header = TRUE)
pileup_107106$S107106 <- pileup_107106$Plus_reads + pileup_107106$Minus_reads
#S107110 -- WS_T4_R1
pileup_107110<-read.table("Pileup.Data/DNA.Pileup/Sample_107110_DNA.pileup", header = TRUE)
pileup_107110$S107110 <- pileup_107110$Plus_reads + pileup_107110$Minus_reads
#S107111 -- WS_T4_R2
pileup_107111<-read.table("Pileup.Data/DNA.Pileup/Sample_107111_DNA.pileup", header = TRUE)
pileup_107111$S107111 <- pileup_107111$Plus_reads + pileup_107111$Minus_reads
#S107112 -- WS_T4_R3
pileup_107112<-read.table("Pileup.Data/DNA.Pileup/Sample_107112_DNA.pileup", header = TRUE)
pileup_107112$S107112 <- pileup_107112$Plus_reads + pileup_107112$Minus_reads
#S107113 -- WS_T24_R1
pileup_107113<-read.table("Pileup.Data/DNA.Pileup/Sample_107113_DNA.pileup", header = TRUE)
pileup_107113$S107113 <- pileup_107113$Plus_reads + pileup_107113$Minus_reads
#S107114 -- WS_T24_R2
pileup_107114<-read.table("Pileup.Data/DNA.Pileup/Sample_107114_DNA.pileup", header = TRUE)
pileup_107114$S107114 <- pileup_107114$Plus_reads + pileup_107114$Minus_reads
#S107115 -- WS_T24_R3
pileup_107115<-read.table("Pileup.Data/DNA.Pileup/Sample_107115_DNA.pileup", header = TRUE)
pileup_107115$S107115 <- pileup_107115$Plus_reads + pileup_107115$Minus_reads
#S107107 -- Tuss_T0_R1
pileup_107107<-read.table("Pileup.Data/DNA.Pileup/Sample_107107_DNA.pileup", header = TRUE)
pileup_107107$S107107 <- pileup_107107$Plus_reads + pileup_107107$Minus_reads
#S107108 -- Tuss_T0_R2
pileup_107108<-read.table("Pileup.Data/DNA.Pileup/Sample_107108_DNA.pileup", header = TRUE)
pileup_107108$S107108 <- pileup_107108$Plus_reads + pileup_107108$Minus_reads
#S107116 -- Tuss_T4_R1
pileup_107116<-read.table("Pileup.Data/DNA.Pileup/Sample_107116_DNA.pileup", header = TRUE)
pileup_107116$S107116 <- pileup_107116$Plus_reads + pileup_107116$Minus_reads
#S107117 -- Tuss_T4_R2
pileup_107117<-read.table("Pileup.Data/DNA.Pileup/Sample_107117_DNA.pileup", header = TRUE)
pileup_107117$S107117 <- pileup_107117$Plus_reads + pileup_107117$Minus_reads
#S107119 -- Tuss_T24_R1
pileup_107119<-read.table("Pileup.Data/DNA.Pileup/Sample_107119_DNA.pileup", header = TRUE)
pileup_107119$S107119 <- pileup_107119$Plus_reads + pileup_107119$Minus_reads
#S107121 -- Tuss_T24_R3
pileup_107121<-read.table("Pileup.Data/DNA.Pileup/Sample_107121_DNA.pileup", header = TRUE)
pileup_107121$S107121 <- pileup_107121$Plus_reads + pileup_107121$Minus_reads
In the previous step, each DNA.pileup file was imported and an additional column was added that summed the “Plus_reads” and the Minus_reads. The sum column in each DNA.pileup file was named after the DNA.pileup sample name (e.g. “S108379”). All DNA.pileup files contain identical genecalls in an identical order allowing us to extract the genecall ID and genecall Length columns from a single sample (S107104), along with the summed count column of each sample, to create a new dataframe that represents the initial, FULL dataset for downstream analyses:
cts_dna_all <- data.frame(pileup_107104$ID, pileup_107104$Length, pileup_107104$S107104, pileup_107105$S107105, pileup_107106$S107106, pileup_107110$S107110, pileup_107111$S107111, pileup_107112$S107112, pileup_107113$S107113, pileup_107114$S107114, pileup_107115$S107115, pileup_107107$S107107, pileup_107108$S107108, pileup_107116$S107116, pileup_107117$S107117, pileup_107119$S107119, pileup_107121$S107121)
names(cts_dna_all) <- c("ID", "Length", "S107104", "S107105", "S107106", "S107110", "S107111", "S107112", "S107113", "S107114", "S107115", "S107107", "S107108", "S107116", "S107117", "S107119", "S107121")
The KEGG Functions file (GhostKoala) and KEGG Taxonomy file (GhostKoala) were imported and the functional and taxonomic annotations were matched to their corresponding genecall ID.
First, a KEGG reference file was imported with additional KEGG Tier pathways that provide additional, useful, functional categories for downstream analyses. The Tier II, III, and IV categories were merged to the GhostKoala KEGG functional annotations in subsequent steps.
# Import KO_Orthology.txt reference file
KO_ref<-read.table(file='Annotations.Data/KO_Orthology.txt', sep='\t', quote="", fill=TRUE, header = FALSE)
# Rename column headers
names(KO_ref)<-c("Tier_II","Tier_III","Tier_IV","KEGG")
# Separate "KEGG" column into "KO" and "Annotation" columns
KO_ref<-KO_ref %>% separate(KEGG, c("KO_Accession","Annotation"), " ", extra="merge")
# Further separatae "Annotation" column into "Symbol" and "Function" columns
KO_ref<-KO_ref %>% separate(Annotation, c("KO_Symbol","KO_Function"), "; ", extra="merge")
There are 3,917,616 total unique DNA-based genecalls in the metagenomic dataset. The GhostKOALA KEGG functional annotations file (“DNA_data_bbnorm_anvio_contigs_annotations.tsv”) contains annotations for 1,389,518 unique DNA-based genecalls. This indicates that 35.5% of the total unique DNA-based genecalls can be functionally annotated. That also means that 64.5% of unique DNA-based genecalls cannot be annotated and will not be included in downstream analyses. Here, the unique genecall functional annotations will be matched to the same unique genecalls in the DNA-based metagenomic dataset.
# Import the KEGG Functional Annotations file that was exported from the Anvi'o DNA contigs database
KeggAnvio<-read.table(file='Annotations.Data/DNA_data_bbnorm_anvio_contigs_annotations.tsv', sep='\t', quote = "", fill = TRUE, header = TRUE)
# Rename column headers
names(KeggAnvio)<-c("ID","Source","Accession","Function","e-value")
# Separate "Function" column into "Symbol" and "Function" columns
KeggAnvio<-KeggAnvio %>% separate(Function, c("Symbol","Function"), "; ", extra="merge")
# Add "genecall_" to the beginning of each gene ID value to match with downstream analyses
KeggAnvio$ID <- paste("genecall", KeggAnvio$ID, sep="_")
# Add "KO_Symbol" category to the dataset, matched by KO's between the KO_ref file and KeggAnvio
KeggAnvio$KO_Symbol = KO_ref[match(KeggAnvio$Accession, KO_ref$KO_Accession), "KO_Symbol"]
# Add "KO_Function" category to the dataset, matched by KO's between the KO_ref file and KeggAnvio
KeggAnvio$KO_Function = KO_ref[match(KeggAnvio$Accession, KO_ref$KO_Accession), "KO_Function"]
# Add "Tier_II" category to the dataset, matched by KO's between the KO_ref file and KeggAnvio
KeggAnvio$Tier_II = KO_ref[match(KeggAnvio$Accession, KO_ref$KO_Accession), "Tier_II"]
# Add "Tier_III" category to the dataset, matched by KO's between the KO_ref file and KeggAnvio
KeggAnvio$Tier_III = KO_ref[match(KeggAnvio$Accession, KO_ref$KO_Accession), "Tier_III"]
# Add "Tier_IV" category to the dataset, matched by KO's between the KO_ref file and KeggAnvio
KeggAnvio$Tier_IV = KO_ref[match(KeggAnvio$Accession, KO_ref$KO_Accession), "Tier_IV"]
# Keep the Function category from the KO Reference file to potentially fill in gaps in KeggAnvio annotations (many KO's labeled as "None", but the KO should match a known function from the reference file).
KeggFunction<-data.frame(KeggAnvio$ID,KeggAnvio$Accession,KeggAnvio$Symbol,KeggAnvio$KO_Function,KeggAnvio$Tier_II,KeggAnvio$Tier_III,KeggAnvio$Tier_IV)
names(KeggFunction)<-c("ID","KO","Symbol","Function","Tier_II","Tier_III","Tier_IV")
# Merge the KO, Symbol, Function, and all Tier columns into single "Combined" column for some downstream analyses
KeggFunction$Combined<-paste(KeggFunction$KO, ":", KeggFunction$Symbol, ":", KeggFunction$Function, ":", KeggFunction$Tier_II, ":", KeggFunction$Tier_III, ":", KeggFunction$Tier_IV)
# Keep just "ID" and "Combined" columns for use with some downstream analyses
KeggData<-data.frame(KeggFunction$ID, KeggFunction$Combined)
names(KeggData)<-c("ID","KEGG")
There are 3,917,616 total unique DNA-based genecalls in the metagenomic dataset. The GhostKoala KEGG taxonomy file (“KeggTaxonomy.txt”) contains annotations for 3,912,253 unique DNA-based genecalls. This indicates that 99.9% of the total unique DNA-based genecalls can be taxonomically annotated. Here, the unique genecall taxonomic annotations will be matched to the same unique genecalls in the DNA-based metagenomic dataset.
#Import KeggTaxonomy.txt from GhostKoala
KeggTaxa<-read.table("Annotations.Data/KeggTaxonomy.txt", header = TRUE, fill = TRUE)
names(KeggTaxa)<-c("ID","Domain","Phylum","Class","Order","Family","Genus","Species")
# Merge the taxonomic classes into a single column
KeggTaxa$Taxonomy <- paste(KeggTaxa$Domain, ":", KeggTaxa$Phylum, ":", KeggTaxa$Class, ":", KeggTaxa$Order, ":", KeggTaxa$Family, ":", KeggTaxa$Genus, ":", KeggTaxa$Species)
# Now keep just "ID" and "Taxonomy"
KeggTaxa <- data.frame(KeggTaxa$ID, KeggTaxa$Taxonomy)
names(KeggTaxa)<-c("ID","Taxonomy")
# Add "genecall_" to the beginning of each gene ID value to match with downstream analyses
KeggTaxa$ID<-paste("genecall", KeggTaxa$ID, sep="_")
The FULL DNA CTS dataset was assembled as a single dataframe (columns: [1] genecall, [2] length, [3-14] metagenome Samples; rows: 3,917,616 unique “genecalls”). Now, retain only those genecalls with a value greater than zero in any single sample. This indicates that these genes were present under at least ONE treatment in the experiment (i.e., potential expression).
Subsample the unique genecalls that have greater than zero raw read counts (“Potential Genes”) across all samples.
#cts_dna_potential
cts_dna_potential <- subset(cts_dna_all, S107104 > 0 | S107105 > 0 | S107106 > 0 | S107110 > 0 | S107111 > 0 | S107112 > 0 | S107113 > 0 | S107114 > 0 | S107115 > 0 | S107107 > 0 | S107108 > 0 | S107116 > 0 | S107117 > 0 | S107119 > 0 | S107121 > 0, select=c(ID,Length,S107104,S107105,S107106,S107110,S107111,S107112,S107113,S107114,S107115,S107107,S107108,S107116,S107117,S107119,S107121))
Add functional and taxonomic annotations to each unique genecall within the “Potential” subdata.
# Potential Genes Functional and Taxonomic Annotations
cts_dna_potential$KEGG = KeggData[match(cts_dna_potential$ID, KeggData$ID), "KEGG"]
cts_dna_potential$Taxonomy = KeggTaxa[match(cts_dna_potential$ID, KeggTaxa$ID), "Taxonomy"]
# Potential Annotated Genes
cts_dna_pot_ann<-na.omit(cts_dna_potential, cols="KEGG")
From this point, we retain only those genecalls that were potentially expressed and have a KEGG functional annotation. Start by separating the KEGG categories into unique columns. Remove any genecall whose symbol annotation is “None”. Further remove any genecall annotated as Tier II “Organismal Systems” or “Human Diseases”.
# Separate "KEGG" column into "KO", "Symbol", "Function", "Tier_II", "Tier_III", and "Tier_IV" columns
cts_dna_pot_ann<-cts_dna_pot_ann %>% separate(KEGG, c("KO","Symbol","Function","Tier_II","Tier_III","Tier_IV"), ": ", extra="merge")
# Remove KO's with "None" as the "Symbol" annotation
cts_dna_pot_ann<-cts_dna_pot_ann[!grepl("None", cts_dna_pot_ann$Symbol),]
# Remove KO's with Tier II category "Organismal Systems"
cts_dna_pot_ann<-cts_dna_pot_ann[!grepl("Organismal Systems ", cts_dna_pot_ann$Tier_II),]
# Remove KO's with Tier II category "Human Diseases"
cts_dna_pot_ann<-cts_dna_pot_ann[!grepl("Human Diseases ", cts_dna_pot_ann$Tier_II),]
Separate the Taxonomy categories into unique columns. Remove any genecall whose taxonomy is outside of “Archaea”, “Bacteria”, or “Fungi”.
# Separate "Taxonomy" column into "Kingdom", "Phylum", "Class", "Order", "Family", "Genus", and "Species" columns
cts_dna_pot_ann<-cts_dna_pot_ann %>% separate(Taxonomy, c("Kingdom","Phylum","Class","Order","Family","Genus","Species"), ": ", extra="merge")
# Remove KO's with "ag" as the "Kingdom" annotation
cts_dna_pot_ann<-cts_dna_pot_ann[!grepl("ag", cts_dna_pot_ann$Kingdom),]
# Remove KO's with "Animals" as the "Kingdom" annotation
cts_dna_pot_ann<-cts_dna_pot_ann[!grepl("Animals", cts_dna_pot_ann$Kingdom),]
# Remove KO's with "Plants" as the "Kingdom" annotation
cts_dna_pot_ann<-cts_dna_pot_ann[!grepl("Plants", cts_dna_pot_ann$Kingdom),]
# Remove KO's with "Protists" as the "Kingdom" annotation
cts_dna_pot_ann<-cts_dna_pot_ann[!grepl("Protists", cts_dna_pot_ann$Kingdom),]
#Re-merge the Taxonomy columns for downstream analysis
cts_dna_pot_ann$Taxonomy <- paste(cts_dna_pot_ann$Kingdom, ":", cts_dna_pot_ann$Phylum, ":", cts_dna_pot_ann$Class, ":", cts_dna_pot_ann$Order, ":", cts_dna_pot_ann$Family, ":", cts_dna_pot_ann$Genus, ":", cts_dna_pot_ann$Species)
GPM Normalization: Converts raw DNA gene counts to Genes per Million. GPM follows the same methds as TPM (Transcripts per Million) described by Wagner et al. 2012 and used to normalize RNA gene counts in the metatranscriptomes section below.
Calculate Tg values by multiplying the read count of each genecall by the average read length within each sample, then divide by the gene length of each genecall.
cts_dna_pot_ann$S107104_Tg<-(cts_dna_pot_ann$S107104 * 139) / cts_dna_pot_ann$Length
cts_dna_pot_ann$S107105_Tg<-(cts_dna_pot_ann$S107105 * 140) / cts_dna_pot_ann$Length
cts_dna_pot_ann$S107106_Tg<-(cts_dna_pot_ann$S107106 * 141) / cts_dna_pot_ann$Length
cts_dna_pot_ann$S107110_Tg<-(cts_dna_pot_ann$S107110 * 140) / cts_dna_pot_ann$Length
cts_dna_pot_ann$S107111_Tg<-(cts_dna_pot_ann$S107111 * 141) / cts_dna_pot_ann$Length
cts_dna_pot_ann$S107112_Tg<-(cts_dna_pot_ann$S107112 * 141) / cts_dna_pot_ann$Length
cts_dna_pot_ann$S107113_Tg<-(cts_dna_pot_ann$S107113 * 141) / cts_dna_pot_ann$Length
cts_dna_pot_ann$S107114_Tg<-(cts_dna_pot_ann$S107114 * 140) / cts_dna_pot_ann$Length
cts_dna_pot_ann$S107115_Tg<-(cts_dna_pot_ann$S107115 * 141) / cts_dna_pot_ann$Length
cts_dna_pot_ann$S107107_Tg<-(cts_dna_pot_ann$S107107 * 141) / cts_dna_pot_ann$Length
cts_dna_pot_ann$S107108_Tg<-(cts_dna_pot_ann$S107108 * 141) / cts_dna_pot_ann$Length
cts_dna_pot_ann$S107116_Tg<-(cts_dna_pot_ann$S107116 * 141) / cts_dna_pot_ann$Length
cts_dna_pot_ann$S107117_Tg<-(cts_dna_pot_ann$S107117 * 141) / cts_dna_pot_ann$Length
cts_dna_pot_ann$S107119_Tg<-(cts_dna_pot_ann$S107119 * 141) / cts_dna_pot_ann$Length
cts_dna_pot_ann$S107121_Tg<-(cts_dna_pot_ann$S107121 * 141) / cts_dna_pot_ann$Length
Now multiple each Tg value within a sample by 1e+06 and divide by the sum of Tg values for that sample to determine GPM.
cts_dna_pot_ann$S107104_GPM<-(cts_dna_pot_ann$S107104_Tg * 1000000) / sum(cts_dna_pot_ann$S107104_Tg,na.rm=TRUE)
cts_dna_pot_ann$S107105_GPM<-(cts_dna_pot_ann$S107105_Tg * 1000000) / sum(cts_dna_pot_ann$S107105_Tg,na.rm=TRUE)
cts_dna_pot_ann$S107106_GPM<-(cts_dna_pot_ann$S107106_Tg * 1000000) / sum(cts_dna_pot_ann$S107106_Tg,na.rm=TRUE)
cts_dna_pot_ann$S107110_GPM<-(cts_dna_pot_ann$S107110_Tg * 1000000) / sum(cts_dna_pot_ann$S107110_Tg,na.rm=TRUE)
cts_dna_pot_ann$S107111_GPM<-(cts_dna_pot_ann$S107111_Tg * 1000000) / sum(cts_dna_pot_ann$S107111_Tg,na.rm=TRUE)
cts_dna_pot_ann$S107112_GPM<-(cts_dna_pot_ann$S107112_Tg * 1000000) / sum(cts_dna_pot_ann$S107112_Tg,na.rm=TRUE)
cts_dna_pot_ann$S107113_GPM<-(cts_dna_pot_ann$S107113_Tg * 1000000) / sum(cts_dna_pot_ann$S107113_Tg,na.rm=TRUE)
cts_dna_pot_ann$S107114_GPM<-(cts_dna_pot_ann$S107114_Tg * 1000000) / sum(cts_dna_pot_ann$S107114_Tg,na.rm=TRUE)
cts_dna_pot_ann$S107115_GPM<-(cts_dna_pot_ann$S107115_Tg * 1000000) / sum(cts_dna_pot_ann$S107115_Tg,na.rm=TRUE)
cts_dna_pot_ann$S107107_GPM<-(cts_dna_pot_ann$S107107_Tg * 1000000) / sum(cts_dna_pot_ann$S107107_Tg,na.rm=TRUE)
cts_dna_pot_ann$S107108_GPM<-(cts_dna_pot_ann$S107108_Tg * 1000000) / sum(cts_dna_pot_ann$S107108_Tg,na.rm=TRUE)
cts_dna_pot_ann$S107116_GPM<-(cts_dna_pot_ann$S107116_Tg * 1000000) / sum(cts_dna_pot_ann$S107116_Tg,na.rm=TRUE)
cts_dna_pot_ann$S107117_GPM<-(cts_dna_pot_ann$S107117_Tg * 1000000) / sum(cts_dna_pot_ann$S107117_Tg,na.rm=TRUE)
cts_dna_pot_ann$S107119_GPM<-(cts_dna_pot_ann$S107119_Tg * 1000000) / sum(cts_dna_pot_ann$S107119_Tg,na.rm=TRUE)
cts_dna_pot_ann$S107121_GPM<-(cts_dna_pot_ann$S107121_Tg * 1000000) / sum(cts_dna_pot_ann$S107121_Tg,na.rm=TRUE)
Create the GPM dataframe by subsetting columns of interest from the cts_dna_pot_ann dataframe.
gpm_all<-data.frame(cts_dna_pot_ann$ID,cts_dna_pot_ann$S107104_GPM,cts_dna_pot_ann$S107105_GPM,cts_dna_pot_ann$S107106_GPM,cts_dna_pot_ann$S107110_GPM,cts_dna_pot_ann$S107111_GPM,cts_dna_pot_ann$S107112_GPM,cts_dna_pot_ann$S107113_GPM,cts_dna_pot_ann$S107114_GPM,cts_dna_pot_ann$S107115_GPM,cts_dna_pot_ann$S107107_GPM,cts_dna_pot_ann$S107108_GPM,cts_dna_pot_ann$S107116_GPM,cts_dna_pot_ann$S107117_GPM,cts_dna_pot_ann$S107119_GPM,cts_dna_pot_ann$S107121_GPM,cts_dna_pot_ann$KO,cts_dna_pot_ann$Symbol,cts_dna_pot_ann$Function,cts_dna_pot_ann$Tier_II,cts_dna_pot_ann$Tier_III,cts_dna_pot_ann$Tier_IV,cts_dna_pot_ann$Taxonomy)
names(gpm_all)<-c("ID","S107104_GPM","S107105_GPM","S107106_GPM","S107110_GPM","S107111_GPM","S107112_GPM","S107113_GPM","S107114_GPM","S107115_GPM","S107107_GPM","S107108_GPM","S107116_GPM","S107117_GPM","S107119_GPM","S107121_GPM","KO","Symbol","Function","Tier_II","Tier_III","Tier_IV","Taxonomy")
Now, we further separate the normalized potential annotated DNA subdata into their experimental treatments in order to analyze statistical differences in potential gene expression patterns between experimental treatments. Pairwise similarities among metagenomces will be calculated using Bray-Curtis similarity values, with differences between treatments assessed via PERMANOVA.
Separate the Tussock samples into their own dataset. Keep only those unique genecalls that have potential expression in at least 1 Tuss sample.
# Subset the TUSS GPM Normalized Potential Annotated subdata
gpm_tuss <- subset(gpm_all, select=c(ID,S107107_GPM,S107108_GPM,S107116_GPM,S107117_GPM,S107119_GPM,S107121_GPM,KO,Symbol,Function,Tier_II,Tier_III,Tier_IV, Taxonomy))
names(gpm_tuss)<-c("ID","Tuss1_T0","Tuss2_T0","Tuss1_T4","Tuss2_T4","Tuss1_T24","Tuss3_T24","KO","Symbol","Function","Tier_II","Tier_III","Tier_IV","Taxonomy")
# Remove any non-expressed genes from the Tussock samples
gpm_tuss_potential <- subset(gpm_tuss, Tuss1_T0 > 0 | Tuss2_T0 > 0 | Tuss1_T4 > 0 | Tuss2_T4 > 0 | Tuss1_T24 > 0 | Tuss3_T24 > 0, select=c(ID,Tuss1_T0,Tuss2_T0,Tuss1_T4,Tuss2_T4,Tuss1_T24,Tuss3_T24,KO,Symbol,Function,Tier_II,Tier_III,Tier_IV,Taxonomy))
Separate the Wet Sedge samples into their own dataset. Keep only those unique genecalls that have potential expression in at least 1 WS sample.
# Subset the WS GPM Normalized Potential Annotated subdata
gpm_ws <- subset(gpm_all, select=c(ID,S107104_GPM,S107105_GPM,S107106_GPM,S107110_GPM,S107111_GPM,S107112_GPM,S107113_GPM,S107114_GPM,S107115_GPM,KO,Symbol,Function,Tier_II,Tier_III,Tier_IV,Taxonomy))
names(gpm_ws)<-c("ID","WS1_T0","WS2_T0","WS3_T0","WS1_T4","WS2_T4","WS3_T4","WS1_T24","WS2_T24","WS3_T24","KO","Symbol","Function","Tier_II","Tier_III","Tier_IV","Taxonomy")
# Remove any non-expressed genes from the Wet Sedge samples
gpm_ws_potential <- subset(gpm_ws, WS1_T0 > 0 | WS2_T0 > 0 | WS3_T0 > 0 | WS1_T4 > 0 | WS2_T4 > 0 | WS3_T4 > 0 | WS1_T24 > 0 | WS2_T24 > 0 | WS3_T24 > 0, select=c(ID,WS1_T0,WS2_T0,WS3_T0,WS1_T4,WS2_T4,WS3_T4,WS1_T24,WS2_T24,WS3_T24,KO,Symbol,Function,Tier_II,Tier_III,Tier_IV,Taxonomy))
We also need the full TUSS and WS Potential Annotated data together to make comparisons between ecosystems within each sampling timepoint. Reformat the gpm_all table and separate the “KEGG” column into all of its functional categories for downstream use.
# Make a new object for the rpm_all table so that you don't overwrite the original
gpm_all_pot_ann<-gpm_all
names(gpm_all_pot_ann)<-c("ID","S107104","S107105","S107106","S107110","S107111","S107112","S107113","S107114","S107115","S107107","S107108","S107116","S107117","S107119","S107121","KO","Symbol","Function","Tier_II","Tier_III","Tier_IV","Taxonomy")
Here, we determine the taxonomic composition of the microbial community by analyzing the relative abundance of potential genes for KEGG tier IV “Ribosomes” (03010 Ribosome PATH: ko03010). Within the “Ribosome” category, there are annotations for “small subunit ribosomal protein” (ssu) and “large subunit ribosomal protein” (lsu). For Bacteria and Archaea, we use the ssu annotations (similar to 16S ssu rRNA targeted analysis). For Fungi, we use both the ssu (18S ssu rRNA) and lsu (28S lsu rRNA) annotations.
The first analysis pulls out all gene expression for “Ribosomes” and removes lsu annotations such that only ssu annotations are used for determining relative expression of Bacteria and Archaea at each sampling time point in both tussock and wet sedge tundra.
# Subset all unique genecalls whose annotation matches "Ribosomes"
taxa_ribo_dna_all<- gpm_all_pot_ann[ which(gpm_all_pot_ann$Tier_IV=='03010 Ribosome [PATH:ko03010]'),]
# Make copy of Ribosome dataset to manipulate
taxa_ribo_dna_ssu<-taxa_ribo_dna_all
# Remove "large subunit ribosomal protein" from dataset
taxa_ribo_dna_ssu<-taxa_ribo_dna_ssu[!grepl("large subunit ribosomal protein", taxa_ribo_dna_ssu$Function),]
# Write data to file for Alpha Diversity statistics
write.csv(taxa_ribo_dna_ssu, 'Ribo.Results/ALPHA.taxa.dna.ssu.csv')
# Remove "Fungi" from dataset
taxa_ribo_dna_ssu<-taxa_ribo_dna_ssu[!grepl("Fungi", taxa_ribo_dna_ssu$Taxonomy),]
# Separate "Taxonomy" column into all subdivisions
taxa_ribo_dna_ssu<-taxa_ribo_dna_ssu %>% separate(Taxonomy, c("Kingdom","Phylum","Class","Order","Family","Genus","Species"), ": ", extra="merge")
# Keep "Kingdom" and "Phylum"
taxa_ribo_dna_ssu <- data.frame(taxa_ribo_dna_ssu$Kingdom,taxa_ribo_dna_ssu$Phylum, taxa_ribo_dna_ssu$S107104,taxa_ribo_dna_ssu$S107105,taxa_ribo_dna_ssu$S107106,taxa_ribo_dna_ssu$S107110,taxa_ribo_dna_ssu$S107111,taxa_ribo_dna_ssu$S107112,taxa_ribo_dna_ssu$S107113,taxa_ribo_dna_ssu$S107114,taxa_ribo_dna_ssu$S107115,taxa_ribo_dna_ssu$S107107,taxa_ribo_dna_ssu$S107108,taxa_ribo_dna_ssu$S107116,taxa_ribo_dna_ssu$S107117,taxa_ribo_dna_ssu$S107119,taxa_ribo_dna_ssu$S107121)
names(taxa_ribo_dna_ssu)<-c("Kingdom","Phylum","ws1-T0","ws2-T0","ws3-T0","ws1-T4","ws2-T4","ws3-T4","ws1-T24","ws2-T24","ws3-T24","tuss1-T0","tuss2-T0","tuss1-T4","tuss2-T4","tuss1-T24","tuss3-T24")
# Remove "Unclassified" taxa otherwise it will clump unclassified Archaea with unclassified Bacteria (not what we want)
taxa_ribo_dna_ssu<-taxa_ribo_dna_ssu[!grepl("Unclassified", taxa_ribo_dna_ssu$Phylum),]
# Remove "Other" taxa otherwise it will clump "other" Archaea with "other" Bacteria (not what we want)
taxa_ribo_dna_ssu<-taxa_ribo_dna_ssu[!grepl("Other", taxa_ribo_dna_ssu$Phylum),]
# Sort data by Kingdom alphabetically
taxa_ribo_dna_ssu<-taxa_ribo_dna_ssu[order(taxa_ribo_dna_ssu$Kingdom,taxa_ribo_dna_ssu$Phylum),]
# Sum each column by unique Phylum
taxa_ribo_dna_ssu_sum<-taxa_ribo_dna_ssu %>% group_by(Phylum) %>% summarise_at(vars("ws1-T0","ws2-T0","ws3-T0","ws1-T4","ws2-T4","ws3-T4","ws1-T24","ws2-T24","ws3-T24","tuss1-T0","tuss2-T0","tuss1-T4","tuss2-T4","tuss1-T24","tuss3-T24"), sum)
taxa_ribo_dna_ssu_sum<-taxa_ribo_dna_ssu_sum %>% mutate_at(vars('ws1-T0','ws2-T0','ws3-T0','ws1-T4','ws2-T4','ws3-T4','ws1-T24','ws2-T24','ws3-T24','tuss1-T0','tuss2-T0','tuss1-T4','tuss2-T4','tuss1-T24','tuss3-T24'), funs(round(., 0)))
taxa_ribo_dna_ssu_sum<-as.data.frame(taxa_ribo_dna_ssu_sum)
taxa_ribo_dna_ssu_sum
Repeat the same “Ribosome” analysis as above (at Phylum level), but this time keep all the taxonomy information.
# Make copy of taxa_ribo_all to manipulate
taxa_ribo_dna_ssu2<-taxa_ribo_dna_all
# Keep "Taxonomy" column
taxa_ribo_dna_ssu2 <- data.frame(taxa_ribo_dna_ssu2$Function,taxa_ribo_dna_ssu2$Taxonomy,taxa_ribo_dna_ssu2$S107104,taxa_ribo_dna_ssu2$S107105,taxa_ribo_dna_ssu2$S107106,taxa_ribo_dna_ssu2$S107110,taxa_ribo_dna_ssu2$S107111,taxa_ribo_dna_ssu2$S107112,taxa_ribo_dna_ssu2$S107113,taxa_ribo_dna_ssu2$S107114,taxa_ribo_dna_ssu2$S107115,taxa_ribo_dna_ssu2$S107107,taxa_ribo_dna_ssu2$S107108,taxa_ribo_dna_ssu2$S107116,taxa_ribo_dna_ssu2$S107117,taxa_ribo_dna_ssu2$S107119,taxa_ribo_dna_ssu2$S107121)
names(taxa_ribo_dna_ssu2)<-c("Function","Taxonomy","ws1-T0","ws2-T0","ws3-T0","ws1-T4","ws2-T4","ws3-T4","ws1-T24","ws2-T24","ws3-T24","tuss1-T0","tuss2-T0","tuss1-T4","tuss2-T4","tuss1-T24","tuss3-T24")
# Remove "large subunit ribosomal protein" from dataset
taxa_ribo_dna_ssu2<-taxa_ribo_dna_ssu2[!grepl("large subunit ribosomal protein", taxa_ribo_dna_ssu2$Function),]
# Remove "Fungi" from dataset
taxa_ribo_dna_ssu2<-taxa_ribo_dna_ssu2[!grepl("Fungi", taxa_ribo_dna_ssu2$Taxonomy),]
# Sort data by Kingdom alphabetically
taxa_ribo_dna_ssu2<-taxa_ribo_dna_ssu2[order(taxa_ribo_dna_ssu2$Function,taxa_ribo_dna_ssu2$Taxonomy),]
# Sum each column by unique Taxonomy
taxa_ribo_dna_ssu2_sum<-taxa_ribo_dna_ssu2 %>% group_by(Taxonomy) %>% summarise_at(vars("ws1-T0","ws2-T0","ws3-T0","ws1-T4","ws2-T4","ws3-T4","ws1-T24","ws2-T24","ws3-T24","tuss1-T0","tuss2-T0","tuss1-T4","tuss2-T4","tuss1-T24","tuss3-T24"), sum)
taxa_ribo_dna_ssu2_sum<-taxa_ribo_dna_ssu2_sum %>% mutate_at(vars('ws1-T0','ws2-T0','ws3-T0','ws1-T4','ws2-T4','ws3-T4','ws1-T24','ws2-T24','ws3-T24','tuss1-T0','tuss2-T0','tuss1-T4','tuss2-T4','tuss1-T24','tuss3-T24'), funs(round(., 0)))
taxa_ribo_dna_ssu2_sum<-as.data.frame(taxa_ribo_dna_ssu2_sum)
taxa_ribo_dna_ssu2_sum
The second analysis pulls out all gene expression for “Ribosomes” and removes Bacteria and Archaea annotations while retaining just Fungi with both ssu and lsu annotations to determine relative expression of Fungi at each sampling time point in both tussock and wet sedge tundra.
# Make copy of Ribosome dataset to manipulate
taxa_ribo_dna_fungi<-taxa_ribo_dna_all
# Remove "Bacteria" from dataset
taxa_ribo_dna_fungi<-taxa_ribo_dna_fungi[!grepl("Bacteria", taxa_ribo_dna_fungi$Taxonomy),]
# Remove "Archaea" from dataset
taxa_ribo_dna_fungi<-taxa_ribo_dna_fungi[!grepl("Archaea", taxa_ribo_dna_fungi$Taxonomy),]
# Separate "Taxonomy" column into all subdivisions
taxa_ribo_dna_fungi<-taxa_ribo_dna_fungi %>% separate(Taxonomy, c("Kingdom","Phylum","Class","Order","Family","Genus","Species"), ": ", extra="merge")
# Keep "Kingdom" and "Phylum"
taxa_ribo_dna_fungi <- data.frame(taxa_ribo_dna_fungi$Kingdom,taxa_ribo_dna_fungi$Phylum, taxa_ribo_dna_fungi$S107104,taxa_ribo_dna_fungi$S107105,taxa_ribo_dna_fungi$S107106,taxa_ribo_dna_fungi$S107110,taxa_ribo_dna_fungi$S107111,taxa_ribo_dna_fungi$S107112,taxa_ribo_dna_fungi$S107113,taxa_ribo_dna_fungi$S107114,taxa_ribo_dna_fungi$S107115,taxa_ribo_dna_fungi$S107107,taxa_ribo_dna_fungi$S107108,taxa_ribo_dna_fungi$S107116,taxa_ribo_dna_fungi$S107117,taxa_ribo_dna_fungi$S107119,taxa_ribo_dna_fungi$S107121)
names(taxa_ribo_dna_fungi)<-c("Kingdom","Phylum","ws1-T0","ws2-T0","ws3-T0","ws1-T4","ws2-T4","ws3-T4","ws1-T24","ws2-T24","ws3-T24","tuss1-T0","tuss2-T0","tuss1-T4","tuss2-T4","tuss1-T24","tuss3-T24")
# Remove "Unclassified" taxa
taxa_ribo_dna_fungi<-taxa_ribo_dna_fungi[!grepl("Unclassified", taxa_ribo_dna_fungi$Phylum),]
# Remove "Other" taxa
taxa_ribo_dna_fungi<-taxa_ribo_dna_fungi[!grepl("Other", taxa_ribo_dna_fungi$Phylum),]
# Sort data by Kingdom alphabetically
taxa_ribo_dna_fungi<-taxa_ribo_dna_fungi[order(taxa_ribo_dna_fungi$Kingdom,taxa_ribo_dna_fungi$Phylum),]
# Sum each column by unique Phylum
taxa_ribo_dna_fungi_sum<-taxa_ribo_dna_fungi %>% group_by(Phylum) %>% summarise_at(vars("ws1-T0","ws2-T0","ws3-T0","ws1-T4","ws2-T4","ws3-T4","ws1-T24","ws2-T24","ws3-T24","tuss1-T0","tuss2-T0","tuss1-T4","tuss2-T4","tuss1-T24","tuss3-T24"), sum)
taxa_ribo_dna_fungi_sum<-taxa_ribo_dna_fungi_sum %>% mutate_at(vars('ws1-T0','ws2-T0','ws3-T0','ws1-T4','ws2-T4','ws3-T4','ws1-T24','ws2-T24','ws3-T24','tuss1-T0','tuss2-T0','tuss1-T4','tuss2-T4','tuss1-T24','tuss3-T24'), funs(round(., 0)))
taxa_ribo_dna_fungi_sum<-as.data.frame(taxa_ribo_dna_fungi_sum)
taxa_ribo_dna_fungi_sum
Repeat the same “Ribosome” analysis as above (at Phylum level), but this time keep all the taxonomy information.
# Make copy of taxa_ribo_all to manipulate
taxa_ribo_dna_fungi2<-taxa_ribo_dna_all
# Keep "Taxonomy" column
taxa_ribo_dna_fungi2 <- data.frame(taxa_ribo_dna_fungi2$Function,taxa_ribo_dna_fungi2$Taxonomy,taxa_ribo_dna_fungi2$S107104,taxa_ribo_dna_fungi2$S107105,taxa_ribo_dna_fungi2$S107106,taxa_ribo_dna_fungi2$S107110,taxa_ribo_dna_fungi2$S107111,taxa_ribo_dna_fungi2$S107112,taxa_ribo_dna_fungi2$S107113,taxa_ribo_dna_fungi2$S107114,taxa_ribo_dna_fungi2$S107115,taxa_ribo_dna_fungi2$S107107,taxa_ribo_dna_fungi2$S107108,taxa_ribo_dna_fungi2$S107116,taxa_ribo_dna_fungi2$S107117,taxa_ribo_dna_fungi2$S107119,taxa_ribo_dna_fungi2$S107121)
names(taxa_ribo_dna_fungi2)<-c("Function","Taxonomy","ws1-T0","ws2-T0","ws3-T0","ws1-T4","ws2-T4","ws3-T4","ws1-T24","ws2-T24","ws3-T24","tuss1-T0","tuss2-T0","tuss1-T4","tuss2-T4","tuss1-T24","tuss3-T24")
# Remove "Bacteria" from dataset
taxa_ribo_dna_fungi2<-taxa_ribo_dna_fungi2[!grepl("Bacteria", taxa_ribo_dna_fungi2$Taxonomy),]
# Remove "Archaea" from dataset
taxa_ribo_dna_fungi2<-taxa_ribo_dna_fungi2[!grepl("Archaea", taxa_ribo_dna_fungi2$Taxonomy),]
# Sort data by Kingdom alphabetically
taxa_ribo_dna_fungi2<-taxa_ribo_dna_fungi2[order(taxa_ribo_dna_fungi2$Function,taxa_ribo_dna_fungi2$Taxonomy),]
# Sum each column by unique Taxonomy
taxa_ribo_dna_fungi2_sum<-taxa_ribo_dna_fungi2 %>% group_by(Taxonomy) %>% summarise_at(vars("ws1-T0","ws2-T0","ws3-T0","ws1-T4","ws2-T4","ws3-T4","ws1-T24","ws2-T24","ws3-T24","tuss1-T0","tuss2-T0","tuss1-T4","tuss2-T4","tuss1-T24","tuss3-T24"), sum)
taxa_ribo_dna_fungi2_sum<-taxa_ribo_dna_fungi2_sum %>% mutate_at(vars('ws1-T0','ws2-T0','ws3-T0','ws1-T4','ws2-T4','ws3-T4','ws1-T24','ws2-T24','ws3-T24','tuss1-T0','tuss2-T0','tuss1-T4','tuss2-T4','tuss1-T24','tuss3-T24'), funs(round(., 0)))
taxa_ribo_dna_fungi2_sum<-as.data.frame(taxa_ribo_dna_fungi2_sum)
taxa_ribo_dna_fungi2_sum
Plot the relative abundance of taxa by all replicates within tussock and wet sedge tundra as a stackplot.
# Place taxa in order for plotting
taxa.MG.all.chart$Species<-factor(taxa.MG.all.chart$Species,levels = c("Acidobacteria","Actinobacteria","Alphaproteobacteria","Betaproteobacteria","Deltaproteobacteria","Gammaproteobacteria","Proteobacteria Unclassified","Bacteroidetes","Chloroflexi","Firmicutes","Planctomycetes","Verrucomicrobia","Bacteria Unclassified","Bacteria Other","Archaea","Fungi"))
taxa.MG.all.chart$Species<-fct_rev(taxa.MG.all.chart$Species)
taxa.MG.all.chart$Sample<-factor(taxa.MG.all.chart$Sample,levels = c("Tuss1-MG-T0","Tuss2-MG-T0","Tuss1-MG-T4","Tuss2-MG-T4","Tuss1-MG-T24","Tuss3-MG-T24","WS1-MG-T0","WS2-MG-T0","WS3-MG-T0","WS1-MG-T4","WS2-MG-T4","WS3-MG-T4","WS1-MG-T24","WS2-MG-T24","WS3-MG-T24"))
colourCount = length(unique(taxa.MG.all.chart$Species))
getPalette = colorRampPalette(brewer.pal(12, "Paired"))
taxa.MG.all.plot<-ggplot(taxa.MG.all.chart, aes(fill=Species, y=Value, x=Sample)) + geom_bar(position = "stack", stat="identity", color="black") + ylab(expression(atop("Relative Taxon", paste("Expression (%)")))) + theme_minimal() + theme(axis.text=element_text(size=10),axis.title=element_text(size=12),axis.title.x=element_blank()) + theme(legend.position = "right", legend.title=element_blank(), legend.text=element_text(size=8), legend.key.size = unit(0.75,"line"), panel.background = element_blank(), panel.grid.major = element_blank(), panel.grid.minor = element_blank(), axis.line = element_blank(), panel.border = element_blank()) + scale_x_discrete(labels = function(Sample) str_wrap(Sample, width = 8)) + scale_size(guide=FALSE) + scale_fill_manual(values = rev(getPalette(colourCount))) + scale_x_discrete(labels = function(Sample) str_wrap(Sample, width = 15)) + theme(axis.text.x = element_text(angle = 270, hjust=0))
Scale for 'x' is already present. Adding another scale for 'x', which will replace the existing
scale.
taxa.MG.all.plot
Plot the mean relative abundance of taxa by sampling time point within tussock and wet sedge tundra as a stackplot.
# Place taxa in order for plotting
taxa.MG.time.mean.chart$Species<-factor(taxa.MG.time.mean.chart$Species,levels = c("Acidobacteria","Actinobacteria","Alphaproteobacteria","Betaproteobacteria","Deltaproteobacteria","Gammaproteobacteria","Proteobacteria Unclassified","Bacteroidetes","Chloroflexi","Firmicutes","Planctomycetes","Verrucomicrobia","Bacteria Unclassified","Bacteria Other","Archaea","Fungi"))
taxa.MG.time.mean.chart$Species<-fct_rev(taxa.MG.time.mean.chart$Species)
taxa.MG.time.mean.chart$Sample<-factor(taxa.MG.time.mean.chart$Sample,levels = c("Tuss-MG-T0","Tuss-MG-T4","Tuss-MG-T24","WS-MG-T0","WS-MG-T4","WS-MG-T24"))
colourCount = length(unique(taxa.MG.time.mean.chart$Species))
getPalette = colorRampPalette(brewer.pal(12, "Paired"))
taxa.MG.time.mean.plot<-ggplot(taxa.MG.time.mean.chart, aes(fill=Species, y=Value, x=Sample)) + geom_bar(position = "stack", stat="identity", color="black") + ylab(expression(atop("Relative Taxon", paste("Expression (%)")))) + theme_minimal() + theme(axis.text=element_text(size=10),axis.title=element_text(size=12),axis.title.x=element_blank()) + theme(legend.position = "right", legend.title=element_blank(), legend.text=element_text(size=8), legend.key.size = unit(0.75,"line"), panel.background = element_blank(), panel.grid.major = element_blank(), panel.grid.minor = element_blank(), axis.line = element_blank(), panel.border = element_blank()) + scale_x_discrete(labels = function(Sample) str_wrap(Sample, width = 8)) + scale_size(guide=FALSE) + scale_fill_manual(values = rev(getPalette(colourCount))) + scale_x_discrete(labels = function(Sample) str_wrap(Sample, width = 15)) + theme(axis.text.x = element_text(angle = 270, hjust=0))
Scale for 'x' is already present. Adding another scale for 'x', which will replace the existing
scale.
taxa.MG.time.mean.plot
Plot the mean relative abundance of taxa by within tussock and wet sedge tundra as a stackplot.
# Place taxa in order for plotting
taxa.MG.tundra.mean.chart$Species<-factor(taxa.MG.tundra.mean.chart$Species,levels = c("Acidobacteria","Actinobacteria","Alphaproteobacteria","Betaproteobacteria","Deltaproteobacteria","Gammaproteobacteria","Proteobacteria Unclassified","Bacteroidetes","Chloroflexi","Firmicutes","Planctomycetes","Verrucomicrobia","Bacteria Unclassified","Bacteria Other","Archaea","Fungi"))
taxa.MG.tundra.mean.chart$Species<-fct_rev(taxa.MG.tundra.mean.chart$Species)
taxa.MG.tundra.mean.chart$Sample<-factor(taxa.MG.tundra.mean.chart$Sample,levels = c("Tuss-MG","WS-MG"))
colourCount = length(unique(taxa.MG.tundra.mean.chart$Species))
getPalette = colorRampPalette(brewer.pal(12, "Paired"))
taxa.MG.tundra.mean.plot<-ggplot(taxa.MG.tundra.mean.chart, aes(fill=Species, y=Value, x=Sample)) + geom_bar(position = "stack", stat="identity", color="black") + ylab(expression(atop("Relative Taxon", paste("Expression (%)")))) + theme_minimal() + theme(axis.text=element_text(size=10),axis.title=element_text(size=12),axis.title.x=element_blank()) + theme(legend.position = "right", legend.title=element_blank(), legend.text=element_text(size=8), legend.key.size = unit(0.75,"line"), panel.background = element_blank(), panel.grid.major = element_blank(), panel.grid.minor = element_blank(), axis.line = element_blank(), panel.border = element_blank()) + scale_x_discrete(labels = function(Sample) str_wrap(Sample, width = 8)) + scale_size(guide=FALSE) + scale_fill_manual(values = rev(getPalette(colourCount))) + scale_x_discrete(labels = function(Sample) str_wrap(Sample, width = 15)) + theme(axis.text.x = element_text(angle = 270, hjust=0))
Scale for 'x' is already present. Adding another scale for 'x', which will replace the existing
scale.
taxa.MG.tundra.mean.plot
Plot the relative expression of ribosomes as a stackplot with the MEAN of samples by time point
# Place taxa in order for plotting
taxa.tuss.dna.mean.ribo.chart$Species<-factor(taxa.tuss.dna.mean.ribo.chart$Species,levels = c("Acidobacteria","Actinobacteria","Alphaproteobacteria","Betaproteobacteria","Deltaproteobacteria","Gammaproteobacteria","Proteobacteria Unclassified","Bacteroidetes","Chloroflexi","Firmicutes","Planctomycetes","Verrucomicrobia","Bacteria Unclassified","Bacteria Other","Archaea","Fungi"))
taxa.tuss.dna.mean.ribo.chart$Species<-fct_rev(taxa.tuss.dna.mean.ribo.chart$Species)
taxa.tuss.dna.mean.ribo.chart$Sample<-factor(taxa.tuss.dna.mean.ribo.chart$Sample,levels = c("Tuss-MG-T0","Tuss-MG-T4","Tuss-MG-T24"))
colourCount = length(unique(taxa.tuss.dna.mean.ribo.chart$Species))
getPalette = colorRampPalette(brewer.pal(12, "Paired"))
taxa.tuss.dna.mean.ribo.plot<-ggplot(taxa.tuss.dna.mean.ribo.chart, aes(fill=Species, y=Value, x=Sample)) + geom_bar(position = "stack", stat="identity", color="black") + ylab(expression(atop("Community Composition", paste("Relative Abundance (%)")))) + scale_fill_manual(values = rev(getPalette(colourCount)), guide=guide_legend(reverse=FALSE)) + theme_classic() + theme(axis.title.x=element_blank(), panel.background = element_blank(), panel.grid.major = element_blank(), panel.grid.minor = element_blank(), panel.border = element_blank(), axis.ticks = element_blank(), axis.text.x = element_text(angle = 270, hjust=0), axis.text=element_text(size=10), axis.title=element_text(size=12), legend.position = "none", legend.title=element_blank(), legend.text=element_text(size=8)) + scale_y_continuous(expand = c(0, 0), limits = c(0, 101))
taxa.tuss.dna.mean.ribo.plot
Plot the relative expression of ribosomes as a stackplot with the MEAN of samples by time point
# Place taxa in order for plotting
taxa.ws.dna.mean.ribo.chart$Species<-factor(taxa.ws.dna.mean.ribo.chart$Species,levels = c("Acidobacteria","Actinobacteria","Alphaproteobacteria","Betaproteobacteria","Deltaproteobacteria","Gammaproteobacteria","Proteobacteria Unclassified","Bacteroidetes","Chloroflexi","Firmicutes","Planctomycetes","Verrucomicrobia","Bacteria Unclassified","Bacteria Other","Archaea","Fungi"))
taxa.ws.dna.mean.ribo.chart$Species<-fct_rev(taxa.ws.dna.mean.ribo.chart$Species)
taxa.ws.dna.mean.ribo.chart$Sample<-factor(taxa.ws.dna.mean.ribo.chart$Sample,levels = c("WS-MG-T0","WS-MG-T4","WS-MG-T24"))
colourCount = length(unique(taxa.ws.dna.mean.ribo.chart$Species))
getPalette = colorRampPalette(brewer.pal(12, "Paired"))
taxa.ws.dna.mean.ribo.plot<-ggplot(taxa.ws.dna.mean.ribo.chart, aes(fill=Species, y=Value, x=Sample)) + geom_bar(position = "stack", stat="identity", color="black") + ylab(expression(atop("Community Composition", paste("Relative Abundance (%)")))) + scale_fill_manual(values = rev(getPalette(colourCount)), guide=guide_legend(reverse=FALSE)) + theme_classic() + theme(axis.title.x=element_blank(), panel.background = element_blank(), panel.grid.major = element_blank(), panel.grid.minor = element_blank(), panel.border = element_blank(), axis.ticks = element_blank(), axis.text.x = element_text(angle = 270, hjust=0), axis.text=element_text(size=10), axis.title=element_text(size=12), legend.position = "right", legend.title=element_blank(), legend.text=element_text(size=8), legend.key.size = unit(0.75,"line")) + scale_y_continuous(expand = c(0, 0), limits = c(0, 101))
taxa.ws.dna.mean.ribo.plot
To determine if there are significant differences in mean relative abundance of dominant taxa between tundra ecosystems, we calculate the mean (SD) of each phylum (dominant phylum; >1%) within tussock tundra and within wet sedge tundra throughout the experiment and compare to each other.
# Rename column headings
colnames(taxa.dna.table)<-c("Taxonomy","Tuss Relative Abundance (%)","WS Relative Abundance (%)","Mean Difference", "Paired t-test (p-value)")
kable(taxa.dna.table, caption = "Relative abundance (mean (SD)) of microbial taxa within tussock tundra and wet sedge tundra with mean difference (%) and significance (p-value).", format.args = list(big.mark = ","), align = "l") %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
| Taxonomy | Tuss Relative Abundance (%) | WS Relative Abundance (%) | Mean Difference | Paired t-test (p-value) |
|---|---|---|---|---|
| Actinobacteria | 24 (6) | 27 (2) | 3% | N.S. |
| Acidobacteria | 34 (5) | 10 (0) | 24% | 0.011 |
| Alphaproteobacteria | 15 (4) | 8 (1) | 7% | N.S. |
| Betaproteobacteria | 2 (1) | 6 (1) | 4% | 0.002 |
| Deltaproteobacteria | 4 (0) | 13 (0) | 9% | 0.002 |
| Gammaproteobacteria | 11 (0) | 2 (0) | 9% | < 0.001 |
| Bacteroidetes | 0 (0) | 2 (1) | 2% | N.S. |
| Chloroflexi | 0 (0) | 8 (0) | 8% | < 0.001 |
| Firmicutes | 4 (1) | 15 (2) | 11% | 0.007 |
| Euryarchaeota | 0 (0) | 2 (0) | 2% | 0.013 |
| Ascomycetes | 1 (0) | 0 (0) | 1% | 0.013 |
Test for differences in mean relative abundance of dominant phyla between tundra ecosystems.
# Pairwise comparisons between tundra ecosystems for each microbial taxonomic class
taxa.dna.phylum.t.test.stats <- taxa.dna.phylum.t.test %>%
group_by(Phylum) %>%
pairwise_t_test(
Abundance ~ Tundra, paired = TRUE,
p.adjust.method = "bonferroni"
) %>%
select(-.y., -n1, -n2, -df, -statistic, -p) # Remove details
taxa.dna.phylum.t.test.stats
Alpha Diversity: We also test for significant differences in the diversity of microbial taxa between tundra ecosystems.
# Richness: Sum up the number of non-zero entries per row
alpha.dna.taxa.rich <- apply(alpha.dna.taxa>0,1,sum)
write.csv(alpha.dna.taxa.rich, 'Ribo.Results/alpha.dna.taxa.rich.csv')
alpha.dna.taxa.rich
ws1-T0 ws2-T0 ws3-T0 ws1-T4 ws2-T4 ws3-T4 ws1-T24 ws2-T24 ws3-T24 tuss1-T0
7741 6717 6841 7331 7427 7007 7810 7226 7535 5491
tuss2-T0 tuss1-T4 tuss2-T4 tuss1-T24 tuss2-T24
4729 3094 4452 5122 4494
# Abundance: sum up the number of non-zero entries per row (1)
alpha.dna.taxa.abund <- apply(alpha.dna.taxa,1,sum)
write.csv(alpha.dna.taxa.abund, 'Ribo.Results/alpha.dna.taxa.abund.csv')
alpha.dna.taxa.abund
ws1-T0 ws2-T0 ws3-T0 ws1-T4 ws2-T4 ws3-T4 ws1-T24 ws2-T24 ws3-T24 tuss1-T0
17376.57 16882.94 17885.92 16825.84 18317.26 18251.95 18054.30 16834.71 18324.59 18125.41
tuss2-T0 tuss1-T4 tuss2-T4 tuss1-T24 tuss2-T24
18296.25 18493.72 18746.65 18811.87 20381.91
# Diversity: Shannon-Wiener Index (H')
alpha.dna.taxa.div <- diversity(alpha.dna.taxa, index="shannon")
write.csv(alpha.dna.taxa.div, 'Ribo.Results/alpha.dna.taxa.div.csv')
alpha.dna.taxa.div
ws1-T0 ws2-T0 ws3-T0 ws1-T4 ws2-T4 ws3-T4 ws1-T24 ws2-T24 ws3-T24 tuss1-T0
8.575538 8.318133 8.537789 8.496218 8.447123 8.541090 8.601345 8.379233 8.521386 8.158248
tuss2-T0 tuss1-T4 tuss2-T4 tuss1-T24 tuss2-T24
7.984487 7.665960 8.030042 8.123260 7.936642
Run paired t-test to determine significance for Shannon Diversity.
t.test(alpha.dna.taxa.t.test$Tuss,alpha.dna.taxa.t.test$WS,paired=TRUE)
Paired t-test
data: alpha.dna.taxa.t.test$Tuss and alpha.dna.taxa.t.test$WS
t = -5.9292, df = 5, p-value = 0.001947
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-0.7208950 -0.2848555
sample estimates:
mean of the differences
-0.5028752
Beta Diversity
Calculate the Bray-Curtis dissimilarity matrix to use with adonis
# Make dna_taxa_gpm_all_groups object
beta.dna.taxa_sample<-c("ws1-T0","ws2-T0","ws3-T0","ws1-T4","ws2-T4","ws3-T4","ws1-T24","ws2-T24","ws3-T24","tuss1-T0","tuss2-T0","tuss1-T4","tuss2-T4","tuss1-T24","tuss2-T24")
beta.dna.taxa_veg<-c("WS","WS","WS","WS","WS","WS","WS","WS","WS","TUSS","TUSS","TUSS","TUSS","TUSS","TUSS")
beta.dna.taxa_time<-c("T0","T0","T0","T4","T4","T4","T24","T24","T24","T0","T0","T4","T4","T24","T24")
beta.dna.taxa_groups<-data.frame(beta.dna.taxa_sample,beta.dna.taxa_veg,beta.dna.taxa_time)
# Convert the values in Column 1 ("beta.dna.taxa_sample") into row names
rownames(beta.dna.taxa_groups)<-beta.dna.taxa_groups[,1]
beta.dna.taxa_groups<-beta.dna.taxa_groups[,-1]
# Bray-Curtis Dissimilarity Matrix for tpm_all_taxa_beta
dna.tpm.taxa.bc.dist<-vegdist(beta.dna.taxa, method = "bray")
Run PERMANOVA using adonis function of the Vegan package
PERMANOVA - MG-GPM-Taxa - Bray-Curtis
adonis(dna.tpm.taxa.bc.dist~beta.dna.taxa_veg*beta.dna.taxa_time, data=beta.dna.taxa_groups)
Call:
adonis(formula = dna.tpm.taxa.bc.dist ~ beta.dna.taxa_veg * beta.dna.taxa_time, data = beta.dna.taxa_groups)
Permutation: free
Number of permutations: 999
Terms added sequentially (first to last)
Df SumsOfSqs MeanSqs F.Model R2 Pr(>F)
beta.dna.taxa_veg 1 2.3500 2.35005 18.7557 0.61547 0.001 ***
beta.dna.taxa_time 2 0.1675 0.08376 0.6685 0.04387 0.736
beta.dna.taxa_veg:beta.dna.taxa_time 2 0.1731 0.08654 0.6907 0.04533 0.630
Residuals 9 1.1277 0.12530 0.29533
Total 14 3.8183 1.00000
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Alpha diversity of each potential, annotated metagenome was assessed by the Shannon index. For beta diversity, pairwise similarities among metagenomes were calculated using Bray-Curtis similarity values and visualized with principle coordinates analysis. The difference between treatments was assessed with PERMANOVA.
We calculated alpha diversity based on raw gene counts. Measurements include Richness, Abundance, and Shannon-Wiener Diversity Index (H’).
# Subset the rpm_all_exp_ann file to prepare for alpha diversity analysis
cts_dna_all_alpha<-subset(cts_dna_pot_ann, select=c(S107104,S107105,S107106,S107110,S107111,S107112,S107113,S107114,S107115,S107107,S107108,S107116,S107117,S107119,S107121))
names(cts_dna_all_alpha)<-c("WS1-T0","WS2-T0","WS3-T0","WS1-T4","WS2-T4","WS3-T4","WS1-T24","WS2-T24","WS3-T24","Tuss1-T0","Tuss2-T0","Tuss1-T4","Tuss2-T4","Tuss1-T24","Tuss3-T24")
# Transpose all but the first column ("ID")
cts_dna_all_alpha<-as.data.frame(t(cts_dna_all_alpha))
Richness: Determine the number of unique genecalls that have more than one count recorded within each sample.
# Sum up the number of non-zero entries per row
apply(cts_dna_all_alpha>0,1,sum)
WS1-T0 WS2-T0 WS3-T0 WS1-T4 WS2-T4 WS3-T4 WS1-T24 WS2-T24 WS3-T24 Tuss1-T0
498790 449289 446055 480455 476840 449096 493473 476967 470771 353537
Tuss2-T0 Tuss1-T4 Tuss2-T4 Tuss1-T24 Tuss3-T24
303729 207538 286227 320923 266596
Abundance: Determine the total abundance of individual genecalls
# sum up the number of non-zero entries per row (1)
apply(cts_dna_all_alpha,1,sum)
WS1-T0 WS2-T0 WS3-T0 WS1-T4 WS2-T4 WS3-T4 WS1-T24 WS2-T24 WS3-T24 Tuss1-T0
4797267 3727362 2610321 4125347 5342266 3147324 4491751 4779012 4997299 5797561
Tuss2-T0 Tuss1-T4 Tuss2-T4 Tuss1-T24 Tuss3-T24
3638615 960585 3147090 3535500 1996760
Shannon-Wiener Diversity Index (H’): An information index that quantifies the uncertainty associated with predicting the identity of a new genecall given the total number of genecalls and the evenness in count abundances within each genecall.
diversity(cts_dna_all_alpha, index="shannon")
WS1-T0 WS2-T0 WS3-T0 WS1-T4 WS2-T4 WS3-T4 WS1-T24 WS2-T24 WS3-T24 Tuss1-T0
12.48649 12.28717 12.48400 12.41941 12.33807 12.47050 12.45966 12.31243 12.39459 11.98654
Tuss2-T0 Tuss1-T4 Tuss2-T4 Tuss1-T24 Tuss3-T24
11.75993 11.69615 11.89319 11.95821 11.61855
The following table is a summary of the Alpha Diversity on the raw gene counts for the Potential Annotated Genes (potential annotated genes for downstream analyses) by sample.
| ID | Tundra | Time | Replicate | Gene Richness | Gene Abundance | Gene Diversity (H') |
|---|---|---|---|---|---|---|
| S107107 | Tussock | T0 | 1 | 353,537 | 5,797,561 | 11.98654 |
| S107108 | Tussock | T0 | 2 | 303,729 | 3,638,615 | 11.75993 |
| S107116 | Tussock | T4 | 1 | 207,538 | 960,585 | 11.69615 |
| S107117 | Tussock | T4 | 2 | 286,227 | 3,147,090 | 11.89319 |
| S107119 | Tussock | T24 | 1 | 320,923 | 3,535,500 | 11.95821 |
| S107121 | Tussock | T24 | 3 | 266,596 | 1,996,760 | 11.61855 |
| S107104 | Wet Sedge | T0 | 1 | 498,790 | 4,797,267 | 12.48649 |
| S107105 | Wet Sedge | T0 | 2 | 449,289 | 3,727,362 | 12.28717 |
| S107106 | Wet Sedge | T0 | 3 | 446,055 | 2,610,321 | 12.48400 |
| S107110 | Wet Sedge | T4 | 1 | 480,455 | 4,125,347 | 12.41941 |
| S107111 | Wet Sedge | T4 | 2 | 476,840 | 5,342,266 | 12.33807 |
| S107112 | Wet Sedge | T4 | 3 | 449,096 | 3,147,324 | 12.47050 |
| S107113 | Wet Sedge | T24 | 1 | 493,473 | 4,491,751 | 12.45966 |
| S107114 | Wet Sedge | T24 | 2 | 476,967 | 4,779,012 | 12.31243 |
| S107115 | Wet Sedge | T24 | 3 | 470,771 | 4,997,299 | 12.39459 |
Run paired t-test to determine significance.
t.test(alpha.dna.t.test$Tuss,alpha.dna.t.test$WS,paired=TRUE)
Paired t-test
data: alpha.dna.t.test$Tuss and alpha.dna.t.test$WS
t = -7.9613, df = 5, p-value = 0.0005042
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-0.7877932 -0.4032302
sample estimates:
mean of the differences
-0.5955117
Plot the Alpha Diversity summary.
# Re-arrange samples for plotting PCoA
alpha.dna$Tundra <- factor(alpha.dna$Tundra, levels=c("Tuss-MG-T0","Tuss-MG-T4","Tuss-MG-T24","WS-MG-T0","WS-MG-T4","WS-MG-T24"))
alpha.dna.boxplot<-ggplot(alpha.dna, aes(x=Tundra, y=H, fill=Tundra)) +
geom_boxplot() + ylab(expression(atop("Shannon-Wiener", paste("Diversity Index (H')")))) + theme_classic() + theme(axis.title.x=element_blank(), axis.text=element_text(size=10), axis.text.x = element_text(angle = 270, hjust=0, vjust=0.5), axis.title=element_text(size=12), legend.position = "none", legend.title=element_blank(), legend.text=element_text(size=10)) + ylim(11.5,13) + annotate(geom="text", x=3.5, y=13, label="Paired t-test") + annotate("text", x=3.5, y=12.9, size=3, label = "Tundra~(italic(p) < 0.001)", parse = TRUE) + scale_fill_manual(values=c("gray83", "gray83", "gray83", "black", "black", "black"), breaks=c("Tuss-MG-T0", "WS-MG-T0", "Tuss-MG-T4", "WS-MG-T4", "Tuss-MG-T24", "WS-MG-T24"), labels=c("Tuss-MG-T0", "WS-MG-T0", "Tuss-MG-T4", "WS-MG-T4", "Tuss-MG-T24", "WS-MG-T24"))
alpha.dna.boxplot
We calculated beta diversity as the variation in gene potential between our two tundra ecosystems using Bray-Curtis similarity values.
First, subset the columns of interest from the gpm_all_pot_ann object (ID and all samples) to analyze the beta diversity of both ecosystems together. Create a gpm_groups object with columns for sample ID, vegetation type, and timepoint to be used for calculating pairwise similiarities among samples via PERMANOVA (adonis from Vegan package)
# Subset the gpm_all_pot_ann file to prepare for beta diversity analysis
gpm_all_beta<-subset(gpm_all_pot_ann, select=c(ID,S107104,S107105,S107106,S107110,S107111,S107112,S107113,S107114,S107115,S107107,S107108,S107116,S107117,S107119,S107121))
names(gpm_all_beta)<-c("ID","WS1-T0","WS2-T0","WS3-T0","WS1-T4","WS2-T4","WS3-T4","WS1-T24","WS2-T24","WS3-T24","Tuss1-T0","Tuss2-T0","Tuss1-T4","Tuss2-T4","Tuss1-T24","Tuss3-T24")
# Remember "ID" column as non-numeric values
gpm_all_beta_ID<-gpm_all_beta$ID
# Transpose all but the first column ("ID")
gpm_all_beta<-as.data.frame(t(gpm_all_beta[,-1]))
colnames(gpm_all_beta)<-gpm_all_beta_ID
# Make gpm_all_groups object
gpm_all_sample<-c("WS1-T0","WS2-T0","WS3-T0","WS1-T4","WS2-T4","WS3-T4","WS1-T24","WS2-T24","WS3-T24","Tuss1-T0","Tuss2-T0","Tuss1-T4","Tuss2-T4","Tuss1-T24","Tuss3-T24")
gpm_all_veg<-c("WS","WS","WS","WS","WS","WS","WS","WS","WS","TUSS","TUSS","TUSS","TUSS","TUSS","TUSS")
gpm_all_time<-c("T0","T0","T0","T4","T4","T4","T24","T24","T24","T0","T0","T4","T4","T24","T24")
gpm_all_groups<-data.frame(gpm_all_sample,gpm_all_veg,gpm_all_time)
# Convert the values in Column 1 ("gpm_sample") into row names
rownames(gpm_all_groups)<-gpm_all_groups[,1]
gpm_all_groups<-gpm_all_groups[,-1]
Second, subset the columns of interest from the gpm_all_pot_ann object (ID and all samples) to analyze the beta diversity of the Wet Sedge ecosystem
# Subset the gpm_ws_exp_ann file to prepare for beta diversity analysis
gpm_ws_beta<-subset(gpm_all_pot_ann, select=c(ID,S107104,S107105,S107106,S107110,S107111,S107112,S107113,S107114,S107115))
names(gpm_ws_beta)<-c("ID","WS1-T0","WS2-T0","WS3-T0","WS1-T4","WS2-T4","WS3-T4","WS1-T24","WS2-T24","WS3-T24")
# Remember "ID" column as non-numeric values
gpm_ws_beta_ID<-gpm_ws_beta$ID
# Transpose all but the first column ("ID")
gpm_ws_beta<-as.data.frame(t(gpm_ws_beta[,-1]))
colnames(gpm_ws_beta)<-gpm_ws_beta_ID
# Make gpm_ws_groups object
gpm_ws_sample<-c("WS1-T0","WS2-T0","WS3-T0","WS1-T4","WS2-T4","WS3-T4","WS1-T24","WS2-T24","WS3-T24")
gpm_ws_veg<-c("WS","WS","WS","WS","WS","WS","WS","WS","WS")
gpm_ws_time<-c("T0","T0","T0","T4","T4","T4","T24","T24","T24")
gpm_ws_groups<-data.frame(gpm_ws_sample,gpm_ws_veg,gpm_ws_time)
# Convert the values in Column 1 ("gpm_sample") into row names
rownames(gpm_ws_groups)<-gpm_ws_groups[,1]
gpm_ws_groups<-gpm_ws_groups[,-1]
Third, subset the columns of interest from the gpm_all_pot_ann object (ID and all samples) to analyze the beta diversity of the Tussock ecosystem
# Subset the gpm_tuss_exp_ann file to prepare for beta diversity analysis
gpm_tuss_beta<-subset(gpm_all_pot_ann, select=c(ID,S107107,S107108,S107116,S107117,S107119,S107121))
names(gpm_tuss_beta)<-c("ID","Tuss1-T0","Tuss2-T0","Tuss1-T4","Tuss2-T4","Tuss1-T24","Tuss3-T24")
# Remember "ID" column as non-numeric values
gpm_tuss_beta_ID<-gpm_tuss_beta$ID
# Transpose all but the first column ("ID")
gpm_tuss_beta<-as.data.frame(t(gpm_tuss_beta[,-1]))
colnames(gpm_tuss_beta)<-gpm_tuss_beta_ID
# Make gpm_tuss_groups object
gpm_tuss_sample<-c("Tuss1-T0","Tuss2-T0","Tuss1-T4","Tuss2-T4","Tuss1-T24","Tuss3-T24")
gpm_tuss_veg<-c("TUSS","TUSS","TUSS","TUSS","TUSS","TUSS")
gpm_tuss_time<-c("T0","T0","T4","T4","T24","T24")
gpm_tuss_groups<-data.frame(gpm_tuss_sample,gpm_tuss_veg,gpm_tuss_time)
# Convert the values in Column 1 ("gpm_sample") into row names
rownames(gpm_tuss_groups)<-gpm_tuss_groups[,1]
gpm_tuss_groups<-gpm_tuss_groups[,-1]
Calculate the Bray-Curtis dissimilarity matrix to use with adonis
# Bray-Curtis Dissimilarity Matrix for gpm_all_beta
gpm.all.bc.dist<-vegdist(gpm_all_beta, method = "bray")
# Bray-Curtis Dissimilarity Matrix for gpm_ws_beta
gpm.ws.bc.dist<-vegdist(gpm_ws_beta, method = "bray")
# Bray-Curtis Dissimilarity Matrix for gpm_tuss_beta
gpm.tuss.bc.dist<-vegdist(gpm_tuss_beta, method = "bray")
Principal coordinates analysis (PCoA, also known as metric multidimensional scaling) attempts to represent the distances between samples in a low-dimensional, Euclidean space. In particular, it maximizes the linear correlation between the distances in the distance matrix, and the distances in a space of low dimension (typically, 2 or 3 axes are selected).
The first step of a PCoA is the construction of a (dis)similarity matrix. While PCA (principal component analysis) is based on Euclidean distances, PCoA can handle (dis)similarity matrices calculated from quantitative, semi-quantitative, qualitative, and mixed variables. As always, the choice of (dis)similarity measure is critical and must be suitable to the data in question. For count data, Bray-Curtis distance is recommended. You can use Jaccard index for presence/absence data. When the distance metric is Euclidean, PCoA is equivalent to PCA.
The first PCoA compares similarity between tundra ecosystems by incorporating the combined dataset.
# Run PCoA analysis on the Bray-Curtis dissimilarity matrix
gpm.all.bc.pcoa<-pcoa(gpm.all.bc.dist)
Extract the eigenvalues and calculate the variation explained by each axis.
pcoa.dna.all<-cmdscale(gpm.all.bc.dist, eig=TRUE, k=2)
pcoa.dna.all
$points
[,1] [,2]
WS1-T0 -0.3547752 0.04884660
WS2-T0 -0.3522181 0.22900408
WS3-T0 -0.3024513 -0.22952916
WS1-T4 -0.3580691 0.05718926
WS2-T4 -0.3473847 0.23428604
WS3-T4 -0.2826346 -0.29264205
WS1-T24 -0.3549740 0.06112501
WS2-T24 -0.3450039 0.18226438
WS3-T24 -0.2466910 -0.35532682
Tuss1-T0 0.5215337 0.03092877
Tuss2-T0 0.4996903 0.03380238
Tuss1-T4 0.5016452 0.03720668
Tuss2-T4 0.5235905 0.04159448
Tuss1-T24 0.5119603 0.01237679
Tuss3-T24 0.3857819 -0.09112643
$eig
[1] 2.434476e+00 4.281970e-01 2.090461e-01 1.589313e-01 1.498846e-01 8.520537e-02 7.310245e-02
[8] 6.279407e-02 5.962779e-02 5.071524e-02 4.608667e-02 4.261096e-02 4.030625e-02 3.946897e-02
[15] 5.789029e-17
$x
NULL
$ac
[1] 0
$GOF
[1] 0.7377163 0.7377163
# Calculate the percent variation explained by Axis 1
pcoa.dna.all.exp.var1<-round(pcoa.dna.all$eig[1] / sum(pcoa.dna.all$eig), 2) * 100
pcoa.dna.all.exp.var1
[1] 63
# Calculate the percent variation explained by Axis 2
pcoa.dna.all.exp.var2<-round(pcoa.dna.all$eig[2] / sum(pcoa.dna.all$eig), 2) * 100
pcoa.dna.all.exp.var2
[1] 11
# Sum the percent variation explained by both axes
pcoa.dna.all.sum.eig<-sum(pcoa.dna.all.exp.var1, pcoa.dna.all.exp.var2)
pcoa.dna.all.sum.eig
[1] 74
Plot the PCoA.
# Extract the plot scores from first two PCoA axes
gpm.all.bc.pcoa.axes <- gpm.all.bc.pcoa$vectors[,c(1,2)]
# Create new object with PCoA axes
all.dna.bc.pcoa<-data.frame(gpm.all.bc.pcoa.axes)
# Add Timepoint variable to the dataframe
all.dna.timepoint<-c('WS-MG-T0','WS-MG-T0','WS-MG-T0','WS-MG-T4','WS-MG-T4','WS-MG-T4','WS-MG-T24','WS-MG-T24','WS-MG-T24','Tuss-MG-T0','Tuss-MG-T0','Tuss-MG-T4','Tuss-MG-T4','Tuss-MG-T24','Tuss-MG-T24')
all.dna.pcoa.1<-all.dna.bc.pcoa$Axis.1
all.dna.pcoa.2<-all.dna.bc.pcoa$Axis.2
all.dna.bc.pcoa<-data.frame(all.dna.timepoint,all.dna.pcoa.1,all.dna.pcoa.2)
# Rename column headings
colnames(all.dna.bc.pcoa)<-c("Timepoint","PCoA_1","PCoA_2")
# Re-arrange samples for plotting PCoA
all.dna.bc.pcoa$Timepoint <- factor(all.dna.bc.pcoa$Timepoint, levels=c("Tuss-MG-T0","WS-MG-T0","Tuss-MG-T4","WS-MG-T4","Tuss-MG-T24","WS-MG-T24"))
# Plot the PCoA
all.dna.pcoa<-ggplot(data=all.dna.bc.pcoa, aes(x=PCoA_1, y=PCoA_2, group=Timepoint, color=Timepoint)) + geom_point(aes(shape=Timepoint, size=0.75, fill=Timepoint)) + scale_fill_manual(values=c("gray83", "black", "gray83", "black", "gray83", "black")) + scale_shape_manual(values=c(21,21,22,22,24,24)) + scale_color_manual(values=c("black", "black", "black", "black", "black", "black")) + xlab("PCoA 1 (63%)") + ylab("PCoA 2 (11%)") + theme_classic() + theme(axis.text=element_text(size=10), axis.title=element_text(size=12), legend.position = "bottom", legend.title=element_blank(), legend.text=element_text(size=10)) + guides(shape = guide_legend(override.aes = list(size = 3))) + scale_size(guide="none") + geom_hline(aes(yintercept = 0), color="gray", linetype="dashed") + geom_vline(aes(xintercept=0), color="gray", linetype="dashed") + ylim(-0.6,0.6) + xlim(-0.6,0.6) + annotate(geom="text", x=0.0, y=0.59, label="PERMANOVA") + annotate("text", x = 0, y = 0.52, size=3, label = "Tundra~(italic(p) == 0.001)", parse = TRUE) + annotate("text", x = 0, y = 0.45, size=3, label = "Time~Point~(italic(p) == 0.755)", parse = TRUE) + annotate("text", x = 0, y = 0.38, size=3, label = "Tundra:Time~Point~(italic(p) == 0.622)", parse = TRUE)
all.dna.pcoa
The second PCoA compares similarity between sampling time points within the Tussock tundra ecosystem.
# Run PCoA analysis on the Bray-Curtis dissimilarity matrix
gpm.tuss.bc.pcoa<-pcoa(gpm.tuss.bc.dist)
Extract the eigenvalues and calculate the variation explained by each axis.
pcoa.dna.tuss<-cmdscale(gpm.tuss.bc.dist, eig=TRUE, k=2)
pcoa.dna.tuss
$points
[,1] [,2]
Tuss1-T0 -0.07908183 -0.007909558
Tuss2-T0 -0.10503550 0.223722911
Tuss1-T4 -0.08734384 -0.288993333
Tuss2-T4 -0.12372951 0.110993812
Tuss1-T24 -0.03580132 -0.059260114
Tuss3-T24 0.43099199 0.021446282
$eig
[1] 2.272602e-01 1.499230e-01 8.571933e-02 5.982460e-02 3.952645e-02 -5.141230e-18
$x
NULL
$ac
[1] 0
$GOF
[1] 0.6708418 0.6708418
# Calculate the percent variation explained by Axis 1
pcoa.dna.tuss.exp.var1<-round(pcoa.dna.tuss$eig[1] / sum(pcoa.dna.tuss$eig), 2) * 100
pcoa.dna.tuss.exp.var1
[1] 40
# Calculate the percent variation explained by Axis 2
pcoa.dna.tuss.exp.var2<-round(pcoa.dna.tuss$eig[2] / sum(pcoa.dna.tuss$eig), 2) * 100
pcoa.dna.tuss.exp.var2
[1] 27
# Sum the percent variation explained by both axes
pcoa.dna.tuss.sum.eig<-sum(pcoa.dna.tuss.exp.var1, pcoa.dna.tuss.exp.var2)
pcoa.dna.tuss.sum.eig
[1] 67
Plot the PCoA.
# Extract the plot scores from first two PCoA axes
gpm.tuss.bc.pcoa.axes <- gpm.tuss.bc.pcoa$vectors[,c(1,2)]
# Create new object with PCoA axes
tuss.dna.bc.pcoa<-data.frame(gpm.tuss.bc.pcoa.axes)
# Add Timepoint variable to the dataframe
tuss.dna.timepoint<-c('Tuss-T0','Tuss-T0','Tuss-T4','Tuss-T4','Tuss-T24','Tuss-T24')
tuss.dna.pcoa.1<-tuss.dna.bc.pcoa$Axis.1
tuss.dna.pcoa.2<-tuss.dna.bc.pcoa$Axis.2
tuss.dna.bc.pcoa<-data.frame(tuss.dna.timepoint,tuss.dna.pcoa.1,tuss.dna.pcoa.2)
# Rename column headings
colnames(tuss.dna.bc.pcoa)<-c("Timepoint","PCoA_1","PCoA_2")
#tuss.dna.bc.pcoa
# Re-arrange samples for plotting PCoA
tuss.dna.bc.pcoa$Timepoint<-factor(tuss.dna.bc.pcoa$Timepoint, levels=c("Tuss-T0","Tuss-T4","Tuss-T24"))
# Plot the PCoA
tuss.dna.pcoa<-ggplot(data=tuss.dna.bc.pcoa, aes(x=PCoA_1, y=PCoA_2, group=Timepoint, color=Timepoint)) + geom_point(aes(shape=Timepoint, size=0.75, fill=Timepoint)) + scale_fill_manual(values=c("palegreen","green3","darkgreen")) + scale_shape_manual(values=c(21,24,22)) + scale_color_manual(values=c("black","black","black")) + xlab("PCoA 1 (40%)") + ylab("PCoA 2 (27%)") + theme_minimal() + theme(axis.text=element_text(size=12),axis.title=element_text(size=14))+ theme(legend.position = "bottom", legend.title=element_blank(), legend.text=element_text(size=12), panel.background = element_blank(), panel.grid.major=element_blank(), panel.grid.minor = element_blank(), axis.line = element_line(colour = "gray"), panel.border = element_rect(colour = "gray", fill=NA, size=1)) + scale_size(guide=FALSE) + guides(shape = guide_legend(override.aes = list(size = 3))) + geom_hline(aes(yintercept = 0), color="gray", linetype="dashed") + geom_vline(aes(xintercept=0), color="gray", linetype="dashed") + ylim(-0.5,0.5) + xlim(-0.5,0.5) + annotate(geom="text", x=0.0, y=0.49, label="PERMANOVA") + annotate("text", x = 0, y = 0.42, size=3, label = "Time~Point~(italic(p) == 0.400)", parse = TRUE)
tuss.dna.pcoa
The third PCoA compares similarity between sampling time points within the Wet Sedge tundra ecosystem.
# Run PCoA analysis on the Bray-Curtis dissimilarity matrix
gpm.ws.bc.pcoa<-pcoa(gpm.ws.bc.dist)
Extract the eigenvalues and calculate the variation explained by each axis.
pcoa.dna.ws<-cmdscale(gpm.ws.bc.dist, eig=TRUE, k=2)
pcoa.dna.ws
$points
[,1] [,2]
WS1-T0 0.06162191 0.18032456
WS2-T0 0.23783608 -0.09068283
WS3-T0 -0.22544893 -0.01858857
WS1-T4 0.06996095 0.17149145
WS2-T4 0.24251872 -0.11274994
WS3-T4 -0.29156305 -0.04368470
WS1-T24 0.07316666 0.18451007
WS2-T24 0.19088361 -0.18508676
WS3-T24 -0.35897595 -0.08553329
$eig
[1] 4.305630e-01 1.607331e-01 7.319049e-02 6.280428e-02 5.078482e-02 4.656522e-02 4.263403e-02
[8] 4.031109e-02 2.654107e-17
$x
NULL
$ac
[1] 0
$GOF
[1] 0.6515042 0.6515042
# Calculate the percent variation explained by Axis 1
pcoa.dna.ws.exp.var1<-round(pcoa.dna.ws$eig[1] / sum(pcoa.dna.ws$eig), 2) * 100
pcoa.dna.ws.exp.var1
[1] 47
# Calculate the percent variation explained by Axis 2
pcoa.dna.ws.exp.var2<-round(pcoa.dna.ws$eig[2] / sum(pcoa.dna.ws$eig), 2) * 100
pcoa.dna.ws.exp.var2
[1] 18
# Sum the percent variation explained by both axes
pcoa.dna.ws.sum.eig<-sum(pcoa.dna.ws.exp.var1, pcoa.dna.ws.exp.var2)
pcoa.dna.ws.sum.eig
[1] 65
Plot the PCoA.
# Extract the plot scores from first two PCoA axes
gpm.ws.bc.pcoa.axes <- gpm.ws.bc.pcoa$vectors[,c(1,2)]
# Create new object with PCoA axes
ws.dna.bc.pcoa<-data.frame(gpm.ws.bc.pcoa.axes)
# Add Timepoint variable to the dataframe
ws.dna.timepoint<-c('WS-T0','WS-T0','WS-T0','WS-T4','WS-T4','WS-T4','WS-T24','WS-T24','WS-T24')
ws.dna.pcoa.1<-ws.dna.bc.pcoa$Axis.1
ws.dna.pcoa.2<-ws.dna.bc.pcoa$Axis.2
ws.dna.bc.pcoa<-data.frame(ws.dna.timepoint,ws.dna.pcoa.1,ws.dna.pcoa.2)
# Rename column headings
colnames(ws.dna.bc.pcoa)<-c("Timepoint","PCoA_1","PCoA_2")
#ws.dna.bc.pcoa
# Re-arrange samples for plotting PCoA
ws.dna.bc.pcoa$Timepoint<-factor(ws.dna.bc.pcoa$Timepoint, levels=c("WS-T0","WS-T4","WS-T24"))
# Plot the PCoA
ws.dna.pcoa<-ggplot(data=ws.dna.bc.pcoa, aes(x=PCoA_1, y=PCoA_2, group=Timepoint, color=Timepoint)) + geom_point(aes(shape=Timepoint, size=0.75, fill=Timepoint)) + scale_fill_manual(values=c("black","black","black")) + scale_shape_manual(values=c(21,24,22)) + scale_color_manual(values=c("gray83","gray83","gray83")) + xlab("PCoA 1 (47%)") + ylab("PCoA 2 (18%)") + theme_minimal() + theme(axis.text=element_text(size=12),axis.title=element_text(size=14))+ theme(legend.position = "bottom", legend.title=element_blank(), legend.text=element_text(size=12), panel.background = element_blank(), panel.grid.major=element_blank(), panel.grid.minor = element_blank(), axis.line = element_line(colour = "gray"), panel.border = element_rect(colour = "gray", fill=NA, size=1)) + scale_size(guide=FALSE) + guides(shape = guide_legend(override.aes = list(size = 3))) + geom_hline(aes(yintercept = 0), color="gray", linetype="dashed") + geom_vline(aes(xintercept=0), color="gray", linetype="dashed") + ylim(-0.5,0.5) + xlim(-0.5,0.5) + annotate(geom="text", x=0.0, y=0.49, label="PERMANOVA") + annotate("text", x = 0, y = 0.42, size=3, label = "Time~Point~(italic(p) == 0.916)", parse = TRUE)
ws.dna.pcoa
adonis function of the Vegan packagePERMANOVA - GPM-All - Bray-Curtis
adonis(gpm.all.bc.dist~gpm_all_veg*gpm_all_time, data=gpm_all_groups)
Call:
adonis(formula = gpm.all.bc.dist ~ gpm_all_veg * gpm_all_time, data = gpm_all_groups)
Permutation: free
Number of permutations: 999
Terms added sequentially (first to last)
Df SumsOfSqs MeanSqs F.Model R2 Pr(>F)
gpm_all_veg 1 2.4106 2.41061 19.2740 0.62122 0.001 ***
gpm_all_time 2 0.1639 0.08197 0.6554 0.04225 0.766
gpm_all_veg:gpm_all_time 2 0.1803 0.09014 0.7207 0.04646 0.613
Residuals 9 1.1256 0.12507 0.29008
Total 14 3.8805 1.00000
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
PERMANOVA - GPM-Tuss - Bray-Curtis
adonis(gpm.tuss.bc.dist~gpm_tuss_time, data=gpm_tuss_groups)
'nperm' >= set of all permutations: complete enumeration.
Set of permutations < 'minperm'. Generating entire set.
Call:
adonis(formula = gpm.tuss.bc.dist ~ gpm_tuss_time, data = gpm_tuss_groups)
Permutation: free
Number of permutations: 719
Terms added sequentially (first to last)
Df SumsOfSqs MeanSqs F.Model R2 Pr(>F)
gpm_tuss_time 2 0.23188 0.11594 1.0528 0.41241 0.4
Residuals 3 0.33037 0.11013 0.58759
Total 5 0.56225 1.00000
PERMANOVA - GPM-WS - Bray-Curtis
adonis(gpm.ws.bc.dist~gpm_ws_time, data=gpm_ws_groups)
Call:
adonis(formula = gpm.ws.bc.dist ~ gpm_ws_time, data = gpm_ws_groups)
Permutation: free
Number of permutations: 999
Terms added sequentially (first to last)
Df SumsOfSqs MeanSqs F.Model R2 Pr(>F)
gpm_ws_time 2 0.11233 0.056163 0.42373 0.12376 0.917
Residuals 6 0.79526 0.132543 0.87624
Total 8 0.90759 1.00000
Look for differences in gene expression at multiple KEGG tier levels (II-IV) using GPM-normalized values.
Look at differences in gene expression patterns for multiple Tier levels within the Tussock tundra between sampling time points.
# Calculate GPM-normalized gene count sums for each Tuss sample in each Tier II category
tuss_dna_tII_sum<-gpm_all_pot_ann %>% group_by(Tier_II) %>% summarise_at(vars("S107107","S107108","S107116","S107117","S107119","S107121"), sum)
tuss_dna_tII_sum<-tuss_dna_tII_sum %>% mutate_at(vars(S107107,S107108,S107116,S107117,S107119,S107121), funs(round(., 0)))
tuss_dna_tII_sum<-as.data.frame(tuss_dna_tII_sum)
tuss_dna_tII_sum<-tuss_dna_tII_sum[order(-tuss_dna_tII_sum$S107107),]
Heatmap for Tier II categories by sampling timepoints
# Create copy of "ws_tII_sum" object to prep for heatmap
tuss_dna_tII_heatmap<-tuss_dna_tII_sum
# Convert the first column (Tier categories) into rownames
rownames(tuss_dna_tII_heatmap) <- tuss_dna_tII_heatmap$Tier_II
tuss_dna_tII_heatmap<-as.data.frame(tuss_dna_tII_heatmap[-1])
# Rename columns to sample IDs
colnames(tuss_dna_tII_heatmap)<-c("Tuss1-T0","Tuss2-T0","Tuss1-T4","Tuss2-T4","Tuss1-T24","Tuss3-T24")
# Remove rows for "Organismal Systems" and "Human Diseases"
#tuss_dna_tII_heatmap <- tuss_dna_tII_heatmap[-c(5,6),]
# Convert dataframe into a matrix for heatmap
tuss_dna_tII_heatmap<-as.matrix(tuss_dna_tII_heatmap)
# Scale matrix values to generate Z-scores
tuss_dna_tII_heatmap<-scale(t(tuss_dna_tII_heatmap))
tuss_dna_tII_heatmap<-t(tuss_dna_tII_heatmap)
# Specify RColorBrewer custom color palette
col <- colorRampPalette(brewer.pal(10, "RdYlBu"))(256)
# Pheatmap
pheatmap(tuss_dna_tII_heatmap, treeheight_row = 0, treeheight_col = 0,cluster_rows = FALSE, cluster_cols = FALSE)
Summed values by replicate for each KEGG tier II category.
| Tier_II | S107107 | S107108 | S107116 | S107117 | S107119 | S107121 | |
|---|---|---|---|---|---|---|---|
| 4 | Metabolism | 696,177 | 697,050 | 691,198 | 696,295 | 690,674 | 673,053 |
| 3 | Genetic Information Processing | 154,049 | 153,683 | 156,607 | 153,035 | 156,642 | 155,088 |
| 2 | Environmental Information Processing | 105,480 | 103,922 | 106,705 | 106,132 | 107,735 | 121,853 |
| 1 | Cellular Processes | 44,293 | 45,345 | 45,489 | 44,538 | 44,949 | 50,006 |
Calculate mean values from the replicate sums
# Calculate mean values from the replicate sums
tuss_dna_tII_sum$meantuss_T0<-apply(tuss_dna_tII_sum[,2:3], 1, mean)
tuss_dna_tII_sum$meantuss_T4<-apply(tuss_dna_tII_sum[,4:5], 1, mean)
tuss_dna_tII_sum$meantuss_T24<-apply(tuss_dna_tII_sum[,6:7], 1, mean)
tuss_dna_tII_sum<-tuss_dna_tII_sum %>% mutate_at(vars(meantuss_T0,meantuss_T4,meantuss_T24), funs(round(., 0)))
# Calculate sd values from the replicate sums
tuss_dna_tII_sum$sdtuss_T0<-apply(tuss_dna_tII_sum[,2:3], 1, sd)
tuss_dna_tII_sum$sdtuss_T4<-apply(tuss_dna_tII_sum[,4:5], 1, sd)
tuss_dna_tII_sum$sdtuss_T24<-apply(tuss_dna_tII_sum[,6:7], 1, sd)
tuss_dna_tII_sum<-tuss_dna_tII_sum %>% mutate_at(vars(sdtuss_T0,sdtuss_T4,sdtuss_T24), funs(round(., 0)))
# Subset Mean values
tuss_dna_tII_mean<-subset(tuss_dna_tII_sum, select=c(Tier_II,meantuss_T0,meantuss_T4,meantuss_T24))
names(tuss_dna_tII_mean)<-c("Tier II", "Tuss-T0", "Tuss-T4", "Tuss-T24")
# Subset SD values
tuss_dna_tII_sd<-subset(tuss_dna_tII_sum, select=c(Tier_II,sdtuss_T0,sdtuss_T4,sdtuss_T24))
names(tuss_dna_tII_sd)<-c("Tier II", "Tuss-T0", "Tuss-T4", "Tuss-T24")
# Remember "Tier II" as non-numeric values
tuss_dna_tII_mean_TierII<-tuss_dna_tII_mean$`Tier II`
tuss_dna_tII_sd_TierII<-tuss_dna_tII_sd$`Tier II`
# Transpose all but first column (Tier II)
tuss_dna_tII_mean<-as.data.frame(t(tuss_dna_tII_mean[,-1]))
colnames(tuss_dna_tII_mean)<-tuss_dna_tII_mean_TierII
tuss_dna_tII_sd<-as.data.frame(t(tuss_dna_tII_sd[,-1]))
colnames(tuss_dna_tII_sd)<-tuss_dna_tII_sd_TierII
# Combine mean and sd into single column with ± divider
tuss_dna_tII_table<-as.data.frame(do.call(cbind, lapply(1:ncol(tuss_dna_tII_mean), function(i) paste0(tuss_dna_tII_mean[ , i], " ± ", tuss_dna_tII_sd[ , i]))))
# Transpose the table back
tuss_dna_tII_table<-t(tuss_dna_tII_table)
# Rename columns to Sites
colnames(tuss_dna_tII_table)<-c("Tussock (T0)", "Tussock (T4)", "Tussock (T24)")
rownames(tuss_dna_tII_table)<-tuss_dna_tII_mean_TierII
kable(tuss_dna_tII_table, caption = "Tussock Tier II KEGG category averages (± standard deviation) by sample for Metagenomes.", format.args = list(big.mark=","), align = "l") %>% kable_styling(bootstrap_options = c("striped","hover","condensed"))
| Tussock (T0) | Tussock (T4) | Tussock (T24) | |
|---|---|---|---|
| Metabolism | 696614 ± 617 | 693746 ± 3604 | 681864 ± 12460 |
| Genetic Information Processing | 153866 ± 259 | 154821 ± 2526 | 155865 ± 1099 |
| Environmental Information Processing | 104701 ± 1102 | 106418 ± 405 | 114794 ± 9983 |
| Cellular Processes | 44819 ± 744 | 45014 ± 672 | 47478 ± 3576 |
Run statistical tests to determine if significant differences exist between sampling timepoints for each Tier KEGG category in the metagenome. Click on the Show/Hide button to see the statistics.
# Subset the "sum" values for each sample
tuss_dna_tII_stats<-subset(tuss_dna_tII_sum, select=c(Tier_II,S107107,S107108,S107116,S107117,S107119,S107121))
rownames(tuss_dna_tII_stats) <- tuss_dna_tII_stats$Tier_II
tuss_dna_tII_stats<-as.data.frame(t(tuss_dna_tII_stats[-1]))
# Create a column with timepoint variable names
Timepoint<-c("T0","T0","T4","T4","T24","T24")
Timepoint<-data.frame(Timepoint)
# Add the timepoint column to the sum table
tuss_dna_tII_stats<-data.frame(Timepoint,tuss_dna_tII_stats)
# Subset response variables for MANOVA
tuss_dna_tII_stats$response <- as.matrix(tuss_dna_tII_stats[, 2:5])
# MANOVA test
tuss_dna_tII_manova <- manova(response ~ Timepoint, data=tuss_dna_tII_stats)
summary.aov(tuss_dna_tII_manova)
Response Metabolism. :
Df Sum Sq Mean Sq F value Pr(>F)
Timepoint 2 244658585 122329293 2.1764 0.2606
Residuals 3 168620589 56206863
Response Genetic.Information.Processing. :
Df Sum Sq Mean Sq F value Pr(>F)
Timepoint 2 3998641 1999321 0.7836 0.5323
Residuals 3 7654028 2551343
Response Environmental.Information.Processing. :
Df Sum Sq Mean Sq F value Pr(>F)
Timepoint 2 116644970 58322485 1.7317 0.3162
Residuals 3 101036809 33678936
Response Cellular.Processes. :
Df Sum Sq Mean Sq F value Pr(>F)
Timepoint 2 8784499 4392249 0.9554 0.4775
Residuals 3 13792177 4597392
Run ANOVA for each category of interest (significant in MANOVA)
# Metabolism
tuss_dna_tII_aov1<-aov(Metabolism.~Timepoint,data=tuss_dna_tII_stats)
summary.aov(tuss_dna_tII_aov1)
Df Sum Sq Mean Sq F value Pr(>F)
Timepoint 2 244658585 122329293 2.176 0.261
Residuals 3 168620589 56206863
TukeyHSD(tuss_dna_tII_aov1)
Tukey multiple comparisons of means
95% family-wise confidence level
Fit: aov(formula = Metabolism. ~ Timepoint, data = tuss_dna_tII_stats)
$Timepoint
diff lwr upr p adj
T24-T0 -14750 -46078.7 16578.7 0.2664497
T4-T0 -2867 -34195.7 28461.7 0.9245050
T4-T24 11883 -19445.7 43211.7 0.3760893
# Genetic Information Processing
tuss_dna_tII_aov2<-aov(Genetic.Information.Processing.~Timepoint,data=tuss_dna_tII_stats)
summary.aov(tuss_dna_tII_aov2)
Df Sum Sq Mean Sq F value Pr(>F)
Timepoint 2 3998641 1999321 0.784 0.532
Residuals 3 7654028 2551343
TukeyHSD(tuss_dna_tII_aov2)
Tukey multiple comparisons of means
95% family-wise confidence level
Fit: aov(formula = Genetic.Information.Processing. ~ Timepoint, data = tuss_dna_tII_stats)
$Timepoint
diff lwr upr p adj
T24-T0 1999 -4675.706 8673.706 0.5065378
T4-T0 955 -5719.706 7629.706 0.8311240
T4-T24 -1044 -7718.706 5630.706 0.8037170
# Environmental Information Processing
tuss_dna_tII_aov3<-aov(Environmental.Information.Processing.~Timepoint,data=tuss_dna_tII_stats)
summary.aov(tuss_dna_tII_aov3)
Df Sum Sq Mean Sq F value Pr(>F)
Timepoint 2 116644970 58322485 1.732 0.316
Residuals 3 101036809 33678936
TukeyHSD(tuss_dna_tII_aov3)
Tukey multiple comparisons of means
95% family-wise confidence level
Fit: aov(formula = Environmental.Information.Processing. ~ Timepoint, data = tuss_dna_tII_stats)
$Timepoint
diff lwr upr p adj
T24-T0 10093.0 -14157.85 34343.85 0.3271200
T4-T0 1717.5 -22533.35 25968.35 0.9535973
T4-T24 -8375.5 -32626.35 15875.35 0.4273774
# Cellular Processing
tuss_dna_tII_aov4<-aov(Cellular.Processes.~Timepoint,data=tuss_dna_tII_stats)
summary.aov(tuss_dna_tII_aov4)
Df Sum Sq Mean Sq F value Pr(>F)
Timepoint 2 8784499 4392249 0.955 0.477
Residuals 3 13792177 4597392
TukeyHSD(tuss_dna_tII_aov4)
Tukey multiple comparisons of means
95% family-wise confidence level
Fit: aov(formula = Cellular.Processes. ~ Timepoint, data = tuss_dna_tII_stats)
$Timepoint
diff lwr upr p adj
T24-T0 2658.5 -6301.407 11618.407 0.5116856
T4-T0 194.5 -8765.407 9154.407 0.9954805
T4-T24 -2464.0 -11423.907 6495.907 0.5532109
Plot the KEGG tier II categories as a barchart.
# Place the KEGG tier II categories in the preferred order for plotting
tuss_dna_tII_bardata$Tier.II <- factor(tuss_dna_tII_bardata$Tier.II,levels = c("Metabolism", "Genetic Information Processing", "Environmental Information Processing", "Cellular Processes"))
tuss_dna_tII_bardata$Sample <- factor(tuss_dna_tII_bardata$Sample,levels=c("Tuss-T0","Tuss-T4","Tuss-T24"))
tuss_dna_tII_barplot<-ggplot(tuss_dna_tII_bardata, aes(x = Tier.II, y = Mean, fill = Sample)) + geom_bar(stat = "identity", position = "dodge", color="black") + geom_errorbar(aes(ymin = Mean - SD, ymax = Mean + SD), width = 0.2, position = position_dodge(0.9)) + ylab(expression(atop("KEGG Tier II Categories", paste("Gene Counts (GPM)")))) + theme_minimal() + theme(axis.text=element_text(size=10), axis.title=element_text(size=12), axis.title.x=element_blank()) + theme(legend.position = "bottom", legend.title=element_blank(), legend.text=element_text(size=10), panel.background = element_blank(), panel.grid.major = element_blank(), panel.grid.minor = element_blank(), axis.line = element_line(colour = "gray"), panel.border = element_rect(colour = "gray", fill=NA, size=1)) + scale_size(guide=FALSE) + scale_fill_manual(values=c("palegreen", "green3", "darkgreen")) + scale_x_discrete(labels = function(Tier.II) str_wrap(Tier.II, width = 10)) + ylim(0,800000) + annotate(geom="text", x=1.0, y=777500, label="italic(N.S.)", parse=TRUE) + annotate(geom="text", x=2.0, y=250000, label="italic(N.S.)", parse=TRUE) + annotate(geom="text", x=3.0, y=200000, label="italic(N.S.)", parse=TRUE) + annotate(geom="text", x=4.0, y=150000, label="italic(N.S.)", parse=TRUE)
tuss_dna_tII_barplot
Calculate GPM-normalized gene count sums for each Tuss sample in each Tier III category.
# Calculate GPM-normalized gene count sums for each Tuss sample in each Tier III category
tuss_dna_tIII_sum<-gpm_all_pot_ann %>% group_by(Tier_III) %>% summarise_at(vars("S107107","S107108","S107116","S107117","S107119","S107121"), sum)
tuss_dna_tIII_sum<-tuss_dna_tIII_sum %>% mutate_at(vars(S107107,S107108,S107116,S107117,S107119,S107121), funs(round(., 0)))
tuss_dna_tIII_sum<-as.data.frame(tuss_dna_tIII_sum)
tuss_dna_tIII_sum<-tuss_dna_tIII_sum[order(-tuss_dna_tIII_sum$S107107),]
# Calculate mean values from the replicate sums
tuss_dna_tIII_sum$meantuss_T0<-apply(tuss_dna_tIII_sum[,2:3], 1, mean)
tuss_dna_tIII_sum$meantuss_T4<-apply(tuss_dna_tIII_sum[,4:5], 1, mean)
tuss_dna_tIII_sum$meantuss_T24<-apply(tuss_dna_tIII_sum[,6:7], 1, mean)
tuss_dna_tIII_sum<-tuss_dna_tIII_sum %>% mutate_at(vars(meantuss_T0,meantuss_T4,meantuss_T24), funs(round(., 0)))
# Calculate sd values from the replicate sums
tuss_dna_tIII_sum$sdtuss_T0<-apply(tuss_dna_tIII_sum[,2:3], 1, sd)
tuss_dna_tIII_sum$sdtuss_T4<-apply(tuss_dna_tIII_sum[,4:5], 1, sd)
tuss_dna_tIII_sum$sdtuss_T24<-apply(tuss_dna_tIII_sum[,6:7], 1, sd)
tuss_dna_tIII_sum<-tuss_dna_tIII_sum %>% mutate_at(vars(sdtuss_T0,sdtuss_T4,sdtuss_T24), funs(round(., 0)))
# Write data to .csv file
write.csv(tuss_dna_tIII_sum, 'Norm.Results/tuss.dna.tIII.mean.sd.csv')
# Subset Mean values
tuss_dna_tIII_mean<-subset(tuss_dna_tIII_sum, select=c(Tier_III,meantuss_T0,meantuss_T4,meantuss_T24))
names(tuss_dna_tIII_mean)<-c("Tier III", "Tuss-T0", "Tuss-T4", "Tuss-T24")
# Subset SD values
tuss_dna_tIII_sd<-subset(tuss_dna_tIII_sum, select=c(Tier_III,sdtuss_T0,sdtuss_T4,sdtuss_T24))
names(tuss_dna_tIII_sd)<-c("Tier III", "Tuss-T0", "Tuss-T4", "Tuss-T24")
# Remember "Tier III" as non-numeric values
tuss_dna_tIII_mean_TierIII<-tuss_dna_tIII_mean$`Tier III`
tuss_dna_tIII_sd_TierIII<-tuss_dna_tIII_sd$`Tier III`
# Transpose all but first column (Tier II)
tuss_dna_tIII_mean<-as.data.frame(t(tuss_dna_tIII_mean[,-1]))
colnames(tuss_dna_tIII_mean)<-tuss_dna_tIII_mean_TierIII
tuss_dna_tIII_sd<-as.data.frame(t(tuss_dna_tIII_sd[,-1]))
colnames(tuss_dna_tIII_sd)<-tuss_dna_tIII_sd_TierIII
# Combine mean and sd into single column with ± divider
tuss_dna_tIII_table<-as.data.frame(do.call(cbind, lapply(1:ncol(tuss_dna_tIII_mean), function(i) paste0(tuss_dna_tIII_mean[ , i], " ± ", tuss_dna_tIII_sd[ , i]))))
# Transpose the table back
tuss_dna_tIII_table<-t(tuss_dna_tIII_table)
# Rename columns to Sites
colnames(tuss_dna_tIII_table)<-c("Tussock (T0)", "Tussock (T4)", "Tussock (T24)")
rownames(tuss_dna_tIII_table)<-tuss_dna_tIII_mean_TierIII
#tuss_dna_tIII_table
Run statistical tests to determine if significant differences exist between sampling timepoints for each Tier KEGG category. Click on the Show/Hide button to see the statistics.
# Subset the "sum" values for each sample
tuss_dna_tIII_stats<-subset(tuss_dna_tIII_sum, select=c(Tier_III,S107107,S107108,S107116,S107117,S107119,S107121))
rownames(tuss_dna_tIII_stats) <- tuss_dna_tIII_stats$Tier_III
tuss_dna_tIII_stats<-as.data.frame(t(tuss_dna_tIII_stats[-1]))
# Add the timepoint column to the sum table
tuss_dna_tIII_stats<-data.frame(Timepoint,tuss_dna_tIII_stats)
# Subset response variables for MANOVA
tuss_dna_tIII_stats$response <- as.matrix(tuss_dna_tIII_stats[, 2:26])
# MANOVA test
tuss_dna_tIII_manova <- manova(response ~ Timepoint, data=tuss_dna_tIII_stats)
summary.aov(tuss_dna_tIII_manova)
Response Overview. :
Df Sum Sq Mean Sq F value Pr(>F)
Timepoint 2 42034360 21017180 2.0957 0.2694
Residuals 3 30085634 10028545
Response Carbohydrate.metabolism. :
Df Sum Sq Mean Sq F value Pr(>F)
Timepoint 2 132577817 66288909 1.79 0.3079
Residuals 3 111100140 37033380
Response Nucleotide.metabolism. :
Df Sum Sq Mean Sq F value Pr(>F)
Timepoint 2 959180 479590 1.5945 0.3375
Residuals 3 902320 300773
Response Translation. :
Df Sum Sq Mean Sq F value Pr(>F)
Timepoint 2 7360526 3680263 3.1084 0.1857
Residuals 3 3551926 1183975
Response Metabolism.of.cofactors.and.vitamins. :
Df Sum Sq Mean Sq F value Pr(>F)
Timepoint 2 17149824 8574912 3.2617 0.1768
Residuals 3 7886993 2628998
Response Energy.metabolism. :
Df Sum Sq Mean Sq F value Pr(>F)
Timepoint 2 9711847 4855923 3.4007 0.1693
Residuals 3 4283807 1427936
Response Membrane.transport. :
Df Sum Sq Mean Sq F value Pr(>F)
Timepoint 2 69812216 34906108 1.2894 0.3943
Residuals 3 81213631 27071210
Response Amino.acid.metabolism. :
Df Sum Sq Mean Sq F value Pr(>F)
Timepoint 2 8808313 4404157 1.2648 0.3996
Residuals 3 10445923 3481974
Response Signal.transduction. :
Df Sum Sq Mean Sq F value Pr(>F)
Timepoint 2 7370262 3685131 3.5366 0.1625
Residuals 3 3125969 1041990
Response Replication.and.repair. :
Df Sum Sq Mean Sq F value Pr(>F)
Timepoint 2 1372254 686127 0.8732 0.5025
Residuals 3 2357335 785778
Response Folding..sorting.and.degradation. :
Df Sum Sq Mean Sq F value Pr(>F)
Timepoint 2 773563 386782 0.7449 0.5462
Residuals 3 1557649 519216
Response Lipid.metabolism. :
Df Sum Sq Mean Sq F value Pr(>F)
Timepoint 2 2805296 1402648 1.0812 0.443
Residuals 3 3891903 1297301
Response Cellular.community...prokaryotes. :
Df Sum Sq Mean Sq F value Pr(>F)
Timepoint 2 190170 95085 0.3855 0.7096
Residuals 3 740006 246669
Response Metabolism.of.other.amino.acids. :
Df Sum Sq Mean Sq F value Pr(>F)
Timepoint 2 240681 120341 0.7848 0.5319
Residuals 3 459990 153330
Response Glycan.biosynthesis.and.metabolism. :
Df Sum Sq Mean Sq F value Pr(>F)
Timepoint 2 4081272 2040636 8.3617 0.05932 .
Residuals 3 732135 244045
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Response Metabolism.of.terpenoids.and.polyketides. :
Df Sum Sq Mean Sq F value Pr(>F)
Timepoint 2 1271419 635710 1.2821 0.3959
Residuals 3 1487561 495854
Response Cell.growth.and.death. :
Df Sum Sq Mean Sq F value Pr(>F)
Timepoint 2 1813585 906793 5.382 0.1018
Residuals 3 505460 168487
Response Xenobiotics.biodegradation.and.metabolism. :
Df Sum Sq Mean Sq F value Pr(>F)
Timepoint 2 1582225 791113 2.958 0.1952
Residuals 3 802351 267450
Response Cell.motility. :
Df Sum Sq Mean Sq F value Pr(>F)
Timepoint 2 3494712 1747356 0.4893 0.6548
Residuals 3 10714089 3571363
Response Biosynthesis.of.other.secondary.metabolites. :
Df Sum Sq Mean Sq F value Pr(>F)
Timepoint 2 21589 10794 0.4887 0.6551
Residuals 3 66267 22089
Response Transport.and.catabolism. :
Df Sum Sq Mean Sq F value Pr(>F)
Timepoint 2 332473 166236 0.2627 0.785
Residuals 3 1898500 632834
Response Transcription. :
Df Sum Sq Mean Sq F value Pr(>F)
Timepoint 2 25944 12972 0.0646 0.9387
Residuals 3 602335 200778
Response Cellular.community...eukaryotes. :
Df Sum Sq Mean Sq F value Pr(>F)
Timepoint 2 2670.3 1335.2 0.3781 0.7137
Residuals 3 10592.5 3530.8
Response Signaling.molecules.and.interaction. :
Df Sum Sq Mean Sq F value Pr(>F)
Timepoint 2 3502.3 1751.17 6.2803 0.08465 .
Residuals 3 836.5 278.83
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Response Enzyme.families. :
Df Sum Sq Mean Sq F value Pr(>F)
Timepoint 2 0.33333 0.16667 1 0.4648
Residuals 3 0.50000 0.16667
Calculate GPM-normalized gene count sums for each Tuss sample in each Tier IV category.
# Calculate GPM-normalized gene count sums for each Tuss sample in each Tier III category
tuss_dna_tIV_sum<-gpm_all_pot_ann %>% group_by(Tier_IV) %>% summarise_at(vars("S107107","S107108","S107116","S107117","S107119","S107121"), sum)
tuss_dna_tIV_sum<-tuss_dna_tIV_sum %>% mutate_at(vars(S107107,S107108,S107116,S107117,S107119,S107121), funs(round(., 0)))
tuss_dna_tIV_sum<-as.data.frame(tuss_dna_tIV_sum)
tuss_dna_tIV_sum<-tuss_dna_tIV_sum[order(-tuss_dna_tIV_sum$S107107),]
# Calculate mean values from the replicate sums
tuss_dna_tIV_sum$meantuss_T0<-apply(tuss_dna_tIV_sum[,2:3], 1, mean)
tuss_dna_tIV_sum$meantuss_T4<-apply(tuss_dna_tIV_sum[,4:5], 1, mean)
tuss_dna_tIV_sum$meantuss_T24<-apply(tuss_dna_tIV_sum[,6:7], 1, mean)
tuss_dna_tIV_sum<-tuss_dna_tIV_sum %>% mutate_at(vars(meantuss_T0,meantuss_T4,meantuss_T24), funs(round(., 0)))
# Calculate sd values from the replicate sums
tuss_dna_tIV_sum$sdtuss_T0<-apply(tuss_dna_tIV_sum[,2:3], 1, sd)
tuss_dna_tIV_sum$sdtuss_T4<-apply(tuss_dna_tIV_sum[,4:5], 1, sd)
tuss_dna_tIV_sum$sdtuss_T24<-apply(tuss_dna_tIV_sum[,6:7], 1, sd)
tuss_dna_tIV_sum<-tuss_dna_tIV_sum %>% mutate_at(vars(sdtuss_T0,sdtuss_T4,sdtuss_T24), funs(round(., 0)))
# Write data to .csv file
write.csv(tuss_dna_tIV_sum, 'Norm.Results/tuss.dna.tIV.mean.sd.csv')
# Subset Mean values
tuss_dna_tIV_mean<-subset(tuss_dna_tIV_sum, select=c(Tier_IV,meantuss_T0,meantuss_T4,meantuss_T24))
names(tuss_dna_tIV_mean)<-c("Tier IV", "Tuss-T0", "Tuss-T4", "Tuss-T24")
# Subset SD values
tuss_dna_tIV_sd<-subset(tuss_dna_tIV_sum, select=c(Tier_IV,sdtuss_T0,sdtuss_T4,sdtuss_T24))
names(tuss_dna_tIV_sd)<-c("Tier IV", "Tuss-T0", "Tuss-T4", "Tuss-T24")
# Remember "Tier IV" as non-numeric values
tuss_dna_tIV_mean_TierIV<-tuss_dna_tIV_mean$`Tier IV`
tuss_dna_tIV_sd_TierIV<-tuss_dna_tIV_sd$`Tier IV`
# Transpose all but first column (Tier IV)
tuss_dna_tIV_mean<-as.data.frame(t(tuss_dna_tIV_mean[,-1]))
colnames(tuss_dna_tIV_mean)<-tuss_dna_tIV_mean_TierIV
tuss_dna_tIV_sd<-as.data.frame(t(tuss_dna_tIV_sd[,-1]))
colnames(tuss_dna_tIV_sd)<-tuss_dna_tIV_sd_TierIV
# Combine mean and sd into single column with ± divider
tuss_dna_tIV_table<-as.data.frame(do.call(cbind, lapply(1:ncol(tuss_dna_tIV_mean), function(i) paste0(tuss_dna_tIV_mean[ , i], " ± ", tuss_dna_tIV_sd[ , i]))))
# Transpose the table back
tuss_dna_tIV_table<-t(tuss_dna_tIV_table)
# Rename columns to Sites
colnames(tuss_dna_tIV_table)<-c("Tussock (T0)", "Tussock (T4)", "Tussock (T24)")
rownames(tuss_dna_tIV_table)<-tuss_dna_tIV_mean_TierIV
#tuss_dna_tIV_table
Look at differences in gene expression patterns for multiple Tier levels within the Wet Sedge tundra between sampling time points.
Calculate GPM-normalized gene count sums for each WS sample in each Tier II category.
# Calculate GPM-normalized gene count sums for each WS sample in each Tier II category
ws_dna_tII_sum<-gpm_all_pot_ann %>% group_by(Tier_II) %>% summarise_at(vars("S107104","S107105","S107106","S107110","S107111","S107112","S107113","S107114","S107115"), sum)
ws_dna_tII_sum<-ws_dna_tII_sum %>% mutate_at(vars(S107104,S107105,S107106,S107110,S107111,S107112,S107113,S107114,S107115), funs(round(., 0)))
ws_dna_tII_sum<-as.data.frame(ws_dna_tII_sum)
ws_dna_tII_sum<-ws_dna_tII_sum[order(-ws_dna_tII_sum$S107104),]
| Tier_II | S107104 | S107105 | S107106 | S107110 | S107111 | S107112 | S107113 | S107114 | S107115 | |
|---|---|---|---|---|---|---|---|---|---|---|
| 4 | Metabolism | 677,998 | 686,740 | 677,310 | 681,200 | 680,806 | 673,076 | 677,516 | 684,991 | 672,620 |
| 3 | Genetic Information Processing | 155,404 | 160,237 | 154,265 | 154,519 | 163,874 | 152,048 | 158,868 | 156,958 | 150,296 |
| 2 | Environmental Information Processing | 129,666 | 120,135 | 129,311 | 128,094 | 122,601 | 134,332 | 127,419 | 124,235 | 135,459 |
| 1 | Cellular Processes | 36,931 | 32,888 | 39,114 | 36,187 | 32,719 | 40,544 | 36,197 | 33,816 | 41,626 |
Heatmap for Tier II categories by sampling timepoints
# Create copy of "ws_dna_tII_sum" object to prep for heatmap
ws_dna_tII_heatmap<-ws_dna_tII_sum
# Convert the first column (Tier categories) into rownames
rownames(ws_dna_tII_heatmap) <- ws_dna_tII_heatmap$Tier_II
ws_dna_tII_heatmap<-as.data.frame(ws_dna_tII_heatmap[-1])
# Rename columns to sample IDs
colnames(ws_dna_tII_heatmap)<-c("WS1-T0","WS2-T0","WS3-T0","WS1-T4","WS2-T4","WS3-T4","WS1-T24","WS2-T24","WS3-T24")
# Remove rows for "Organismal Systems" and "Human Diseases"
#ws_dna_tII_heatmap <- ws_dna_tII_heatmap[-c(5,6),]
# Convert dataframe into a matrix for heatmap
ws_dna_tII_heatmap<-as.matrix(ws_dna_tII_heatmap)
# Scale matrix values to generate Z-scores
ws_dna_tII_heatmap<-scale(t(ws_dna_tII_heatmap))
ws_dna_tII_heatmap<-t(ws_dna_tII_heatmap)
# Specify RColorBrewer custom color palette
col <- colorRampPalette(brewer.pal(10, "RdYlBu"))(256)
# Pheatmap
pheatmap(ws_dna_tII_heatmap, treeheight_row = 0, treeheight_col = 0,cluster_rows = FALSE, cluster_cols = FALSE)
Calculate mean values from the replicate sums.
# Calculate mean values from the replicate sums
ws_dna_tII_sum$meanWS_T0<-apply(ws_dna_tII_sum[,2:4], 1, mean)
ws_dna_tII_sum$meanWS_T4<-apply(ws_dna_tII_sum[,5:7], 1, mean)
ws_dna_tII_sum$meanWS_T24<-apply(ws_dna_tII_sum[,8:10], 1, mean)
ws_dna_tII_sum<-ws_dna_tII_sum %>% mutate_at(vars(meanWS_T0,meanWS_T4,meanWS_T24), funs(round(., 0)))
# Calculate sd values from the replicate sums
ws_dna_tII_sum$sdWS_T0<-apply(ws_dna_tII_sum[,2:4], 1, sd)
ws_dna_tII_sum$sdWS_T4<-apply(ws_dna_tII_sum[,5:7], 1, sd)
ws_dna_tII_sum$sdWS_T24<-apply(ws_dna_tII_sum[,8:10], 1, sd)
ws_dna_tII_sum<-ws_dna_tII_sum %>% mutate_at(vars(sdWS_T0,sdWS_T4,sdWS_T24), funs(round(., 0)))
# Subset Mean values
ws_dna_tII_mean<-subset(ws_dna_tII_sum, select=c(Tier_II,meanWS_T0,meanWS_T4,meanWS_T24))
names(ws_dna_tII_mean)<-c("Tier II", "WS - T0", "WS - T4", "WS - T24")
# Subset SD values
ws_dna_tII_sd<-subset(ws_dna_tII_sum, select=c(Tier_II,sdWS_T0,sdWS_T4,sdWS_T24))
names(ws_dna_tII_sd)<-c("Tier II", "WS - T0", "WS - T4", "WS - T24")
# Remember "Tier II" as non-numeric values
ws_dna_tII_mean_TierII<-ws_dna_tII_mean$`Tier II`
ws_dna_tII_sd_TierII<-ws_dna_tII_sd$`Tier II`
# Transpose all but first column (Tier II)
ws_dna_tII_mean<-as.data.frame(t(ws_dna_tII_mean[,-1]))
colnames(ws_dna_tII_mean)<-ws_dna_tII_mean_TierII
ws_dna_tII_sd<-as.data.frame(t(ws_dna_tII_sd[,-1]))
colnames(ws_dna_tII_sd)<-ws_dna_tII_sd_TierII
# Combine mean and sd into single column with ± divider
ws_dna_tII_table<-as.data.frame(do.call(cbind, lapply(1:ncol(ws_dna_tII_mean), function(i) paste0(ws_dna_tII_mean[ , i], " ± ", ws_dna_tII_sd[ , i]))))
# Transpose the table back
ws_dna_tII_table<-t(ws_dna_tII_table)
# Rename columns to Sites
colnames(ws_dna_tII_table)<-c("Wet Sedge (T0)", "Wet Sedge (T4)", "Wet Sedge (T24)")
rownames(ws_dna_tII_table)<-ws_dna_tII_mean_TierII
kable(ws_dna_tII_table, caption = "Wet Sedge Tier II KEGG category averages (± standard deviation) by sample.", format.args = list(big.mark=","), align = "l") %>% kable_styling(bootstrap_options = c("striped","hover","condensed"))
| Wet Sedge (T0) | Wet Sedge (T4) | Wet Sedge (T24) | |
|---|---|---|---|
| Metabolism | 680683 ± 5257 | 678361 ± 4581 | 678376 ± 6230 |
| Genetic Information Processing | 156635 ± 3171 | 156814 ± 6238 | 155374 ± 4500 |
| Environmental Information Processing | 126371 ± 5403 | 128342 ± 5869 | 129038 ± 5784 |
| Cellular Processes | 36311 ± 3159 | 36483 ± 3921 | 37213 ± 4003 |
Run statistical tests to determine if significant differences exist between sampling timepoints for each Tier KEGG category. Click on the Show/Hide button to see the statistics.
# Subset the "sum" values for each sample
ws_dna_tII_stats<-subset(ws_dna_tII_sum, select=c(Tier_II,S107104,S107105,S107106,S107110,S107111,S107112,S107113,S107114,S107115))
rownames(ws_dna_tII_stats) <- ws_dna_tII_stats$Tier_II
ws_dna_tII_stats<-as.data.frame(t(ws_dna_tII_stats[-1]))
# Create Timepoint object for WS_DNA (9 samples)
Timepoint2<-c("T0","T0","T0","T4","T4","T4","T24","T24","T24")
Timepoint2<-data.frame(Timepoint2)
# Add the timepoint column to the sum table
ws_dna_tII_stats<-data.frame(Timepoint2,ws_dna_tII_stats)
# Subset response variables for MANOVA
ws_dna_tII_stats$response <- as.matrix(ws_dna_tII_stats[, 2:5])
# MANOVA test
ws_dna_tII_manova <- manova(response ~ Timepoint2, data=ws_dna_tII_stats)
summary.aov(ws_dna_tII_manova)
Response Metabolism. :
Df Sum Sq Mean Sq F value Pr(>F)
Timepoint2 2 10714158 5357079 0.1838 0.8366
Residuals 6 174872134 29145356
Response Genetic.Information.Processing. :
Df Sum Sq Mean Sq F value Pr(>F)
Timepoint2 2 3695405 1847702 0.0801 0.924
Residuals 6 138435221 23072537
Response Environmental.Information.Processing. :
Df Sum Sq Mean Sq F value Pr(>F)
Timepoint2 2 11483847 5741923 0.1774 0.8417
Residuals 6 194208216 32368036
Response Cellular.Processes. :
Df Sum Sq Mean Sq F value Pr(>F)
Timepoint2 2 1375716 687858 0.0499 0.9517
Residuals 6 82751605 13791934
Plot a barchart for Tier II categories by sampling timepoints.
# Place Tier II categories in the preferred order for plotting
ws_dna_tII_bardata$Tier.II <- factor(ws_dna_tII_bardata$Tier.II,levels = c("Metabolism", "Genetic Information Processing", "Environmental Information Processing", "Cellular Processes"))
ws_dna_tII_bardata$Sample <- factor(ws_dna_tII_bardata$Sample,levels=c("WS-T0","WS-T4","WS-T24"))
ws_dna_tII_barplot<-ggplot(ws_dna_tII_bardata, aes(x = Tier.II, y = Mean, fill = Sample)) + geom_bar(stat = "identity", position = "dodge", color="black") + geom_errorbar(aes(ymin = Mean - SD, ymax = Mean + SD), width = 0.2, position = position_dodge(0.9)) + ylab(expression(atop("KEGG Tier II Categories", paste("Gene Counts (GPM)")))) + theme_minimal() + theme(axis.text=element_text(size=10), axis.title=element_text(size=12), axis.title.x=element_blank()) + theme(legend.position = "bottom", legend.title=element_blank(), legend.text=element_text(size=10), panel.background = element_blank(), panel.grid.major = element_blank(), panel.grid.minor = element_blank(), axis.line = element_line(colour = "gray"), panel.border = element_rect(colour = "gray", fill=NA, size=1)) + scale_size(guide=FALSE) + scale_fill_manual(values=c("lightskyblue1","dodgerblue1","midnightblue")) +
scale_x_discrete(labels = function(Tier.II) str_wrap(Tier.II, width = 10)) + ylim(0,800000) + annotate(geom="text", x=1.0, y=750000, label="italic(N.S.)", parse=TRUE) + annotate(geom="text", x=2.0, y=250000, label="italic(N.S.)", parse=TRUE) + annotate(geom="text", x=3.0, y=200000, label="italic(N.S.)", parse=TRUE) + annotate(geom="text", x=4.0, y=150000, label="italic(N.S.)", parse=TRUE)
ws_dna_tII_barplot
Calculate GPM-normalized gene count sums for each WS sample in each Tier III category.
# Calculate GPM-normalized gene count sums for each WS sample in each Tier III category
ws_dna_tIII_sum<-gpm_all_pot_ann %>% group_by(Tier_III) %>% summarise_at(vars("S107104","S107105","S107106","S107110","S107111","S107112","S107113","S107114","S107115"), sum)
ws_dna_tIII_sum<-ws_dna_tIII_sum %>% mutate_at(vars(S107104,S107105,S107106,S107110,S107111,S107112,S107113,S107114,S107115), funs(round(., 0)))
ws_dna_tIII_sum<-as.data.frame(ws_dna_tIII_sum)
ws_dna_tIII_sum<-ws_dna_tIII_sum[order(-ws_dna_tIII_sum$S107104),]
# Calculate mean values from the replicate sums
ws_dna_tIII_sum$meanWS_T0<-apply(ws_dna_tIII_sum[,2:4], 1, mean)
ws_dna_tIII_sum$meanWS_T4<-apply(ws_dna_tIII_sum[,5:7], 1, mean)
ws_dna_tIII_sum$meanWS_T24<-apply(ws_dna_tIII_sum[,8:10], 1, mean)
ws_dna_tIII_sum<-ws_dna_tIII_sum %>% mutate_at(vars(meanWS_T0,meanWS_T4,meanWS_T24), funs(round(., 0)))
# Calculate sd values from the replicate sums
ws_dna_tIII_sum$sdWS_T0<-apply(ws_dna_tIII_sum[,2:4], 1, sd)
ws_dna_tIII_sum$sdWS_T4<-apply(ws_dna_tIII_sum[,5:7], 1, sd)
ws_dna_tIII_sum$sdWS_T24<-apply(ws_dna_tIII_sum[,8:10], 1, sd)
ws_dna_tIII_sum<-ws_dna_tIII_sum %>% mutate_at(vars(sdWS_T0,sdWS_T4,sdWS_T24), funs(round(., 0)))
# Write data to .csv file
write.csv(ws_dna_tIII_sum, 'Norm.Results/ws.dna.tIII.mean.sd.csv')
# Subset Mean values
ws_dna_tIII_mean<-subset(ws_dna_tIII_sum, select=c(Tier_III,meanWS_T0,meanWS_T4,meanWS_T24))
names(ws_dna_tIII_mean)<-c("Tier III", "WS - T0", "WS - T4", "WS - T24")
# Subset SD values
ws_dna_tIII_sd<-subset(ws_dna_tIII_sum, select=c(Tier_III,sdWS_T0,sdWS_T4,sdWS_T24))
names(ws_dna_tIII_sd)<-c("Tier III", "WS - T0", "WS - T4", "WS - T24")
# Remember "Tier III" as non-numeric values
ws_dna_tIII_mean_TierIII<-ws_dna_tIII_mean$`Tier III`
ws_dna_tIII_sd_TierIII<-ws_dna_tIII_sd$`Tier III`
# Transpose all but first column (Tier II)
ws_dna_tIII_mean<-as.data.frame(t(ws_dna_tIII_mean[,-1]))
colnames(ws_dna_tIII_mean)<-ws_dna_tIII_mean_TierIII
ws_dna_tIII_sd<-as.data.frame(t(ws_dna_tIII_sd[,-1]))
colnames(ws_dna_tIII_sd)<-ws_dna_tIII_sd_TierIII
# Combine mean and sd into single column with ± divider
ws_dna_tIII_table<-as.data.frame(do.call(cbind, lapply(1:ncol(ws_dna_tIII_mean), function(i) paste0(ws_dna_tIII_mean[ , i], " ± ", ws_dna_tIII_sd[ , i]))))
# Transpose the table back
ws_dna_tIII_table<-t(ws_dna_tIII_table)
# Rename columns to Sites
colnames(ws_dna_tIII_table)<-c("Wet Sedge (T0)", "Wet Sedge (T4)", "Wet Sedge (T24)")
rownames(ws_dna_tIII_table)<-ws_dna_tIII_mean_TierIII
#ws_dna_tIII_table
Run statistical tests to determine if significant differences exist between sampling timepoints for each Tier KEGG category. Click on the Show/Hide button to see the statistics.
# Subset the "sum" values for each sample
ws_dna_tIII_stats<-subset(ws_dna_tIII_sum, select=c(Tier_III,S107104,S107105,S107106,S107110,S107111,S107112,S107113,S107114,S107115))
rownames(ws_dna_tIII_stats) <- ws_dna_tIII_stats$Tier_III
ws_dna_tIII_stats<-as.data.frame(t(ws_dna_tIII_stats[-1]))
# Add the timepoint column to the sum table
ws_dna_tIII_stats<-data.frame(Timepoint2,ws_dna_tIII_stats)
# Subset response variables for MANOVA
ws_dna_tIII_stats$response <- as.matrix(ws_dna_tIII_stats[, 2:26])
# MANOVA test
ws_dna_tIII_manova <- manova(response ~ Timepoint2, data=ws_dna_tIII_stats)
summary.aov(ws_dna_tIII_manova)
Response Overview. :
Df Sum Sq Mean Sq F value Pr(>F)
Timepoint2 2 11592649 5796324 1.7561 0.251
Residuals 6 19803697 3300616
Response Carbohydrate.metabolism. :
Df Sum Sq Mean Sq F value Pr(>F)
Timepoint2 2 1502651 751325 0.067 0.9359
Residuals 6 67264565 11210761
Response Membrane.transport. :
Df Sum Sq Mean Sq F value Pr(>F)
Timepoint2 2 4254406 2127203 0.1071 0.9001
Residuals 6 119152955 19858826
Response Nucleotide.metabolism. :
Df Sum Sq Mean Sq F value Pr(>F)
Timepoint2 2 1612242 806121 0.8835 0.461
Residuals 6 5474222 912370
Response Translation. :
Df Sum Sq Mean Sq F value Pr(>F)
Timepoint2 2 574673 287336 0.0443 0.957
Residuals 6 38927135 6487856
Response Energy.metabolism. :
Df Sum Sq Mean Sq F value Pr(>F)
Timepoint2 2 4006349 2003174 0.17 0.8476
Residuals 6 70700867 11783478
Response Metabolism.of.cofactors.and.vitamins. :
Df Sum Sq Mean Sq F value Pr(>F)
Timepoint2 2 1523417 761708 0.0481 0.9534
Residuals 6 94921883 15820314
Response Amino.acid.metabolism. :
Df Sum Sq Mean Sq F value Pr(>F)
Timepoint2 2 888100 444050 0.1359 0.8755
Residuals 6 19604963 3267494
Response Replication.and.repair. :
Df Sum Sq Mean Sq F value Pr(>F)
Timepoint2 2 1456087 728043 0.1421 0.8703
Residuals 6 30732394 5122066
Response Signal.transduction. :
Df Sum Sq Mean Sq F value Pr(>F)
Timepoint2 2 2916123 1458061 0.6633 0.5492
Residuals 6 13188454 2198076
Response Folding..sorting.and.degradation. :
Df Sum Sq Mean Sq F value Pr(>F)
Timepoint2 2 268229 134114 0.5365 0.6104
Residuals 6 1499881 249980
Response Lipid.metabolism. :
Df Sum Sq Mean Sq F value Pr(>F)
Timepoint2 2 628289 314144 0.8499 0.4732
Residuals 6 2217703 369617
Response Glycan.biosynthesis.and.metabolism. :
Df Sum Sq Mean Sq F value Pr(>F)
Timepoint2 2 895363 447681 0.3709 0.7049
Residuals 6 7241176 1206863
Response Cellular.community...prokaryotes. :
Df Sum Sq Mean Sq F value Pr(>F)
Timepoint2 2 1030745 515372 0.0731 0.9303
Residuals 6 42304933 7050822
Response Metabolism.of.other.amino.acids. :
Df Sum Sq Mean Sq F value Pr(>F)
Timepoint2 2 193273 96636 0.7342 0.5185
Residuals 6 789719 131620
Response Metabolism.of.terpenoids.and.polyketides. :
Df Sum Sq Mean Sq F value Pr(>F)
Timepoint2 2 120644 60322 0.0982 0.9079
Residuals 6 3686720 614453
Response Cell.growth.and.death. :
Df Sum Sq Mean Sq F value Pr(>F)
Timepoint2 2 1454 727 0.0028 0.9972
Residuals 6 1541959 256993
Response Xenobiotics.biodegradation.and.metabolism. :
Df Sum Sq Mean Sq F value Pr(>F)
Timepoint2 2 246889 123444 0.1748 0.8437
Residuals 6 4236271 706045
Response Cell.motility. :
Df Sum Sq Mean Sq F value Pr(>F)
Timepoint2 2 156680 78340 0.1733 0.8449
Residuals 6 2711909 451985
Response Biosynthesis.of.other.secondary.metabolites. :
Df Sum Sq Mean Sq F value Pr(>F)
Timepoint2 2 8076 4038.1 0.2206 0.8083
Residuals 6 109827 18304.6
Response Transport.and.catabolism. :
Df Sum Sq Mean Sq F value Pr(>F)
Timepoint2 2 3354 1677 0.593 0.5821
Residuals 6 16968 2828
Response Transcription. :
Df Sum Sq Mean Sq F value Pr(>F)
Timepoint2 2 10122 5061 0.1253 0.8845
Residuals 6 242390 40398
Response Signaling.molecules.and.interaction. :
Df Sum Sq Mean Sq F value Pr(>F)
Timepoint2 2 505.56 252.78 0.4969 0.6314
Residuals 6 3052.00 508.67
Response Cellular.community...eukaryotes. :
Df Sum Sq Mean Sq F value Pr(>F)
Timepoint2 2 108.22 54.111 0.3796 0.6995
Residuals 6 855.33 142.556
Response Enzyme.families. :
Df Sum Sq Mean Sq F value Pr(>F)
Timepoint2 2 0.6667 0.3333 0.1034 0.9033
Residuals 6 19.3333 3.2222
Calculate GPM-normalized gene count sums for each WS sample in each Tier IV category.
# Calculate GPM-normalized gene count sums for each WS sample in each Tier IV category
ws_dna_tIV_sum<-gpm_all_pot_ann %>% group_by(Tier_IV) %>% summarise_at(vars("S107104","S107105","S107106","S107110","S107111","S107112","S107113","S107114","S107115"), sum)
ws_dna_tIV_sum<-ws_dna_tIV_sum %>% mutate_at(vars(S107104,S107105,S107106,S107110,S107111,S107112,S107113,S107114,S107115), funs(round(., 0)))
ws_dna_tIV_sum<-as.data.frame(ws_dna_tIV_sum)
ws_dna_tIV_sum<-ws_dna_tIV_sum[order(-ws_dna_tIV_sum$S107104),]
# Calculate mean values from the replicate sums
ws_dna_tIV_sum$meanWS_T0<-apply(ws_dna_tIV_sum[,2:4], 1, mean)
ws_dna_tIV_sum$meanWS_T4<-apply(ws_dna_tIV_sum[,5:7], 1, mean)
ws_dna_tIV_sum$meanWS_T24<-apply(ws_dna_tIV_sum[,8:10], 1, mean)
ws_dna_tIV_sum<-ws_dna_tIV_sum %>% mutate_at(vars(meanWS_T0,meanWS_T4,meanWS_T24), funs(round(., 0)))
# Calculate sd values from the replicate sums
ws_dna_tIV_sum$sdWS_T0<-apply(ws_dna_tIV_sum[,2:4], 1, sd)
ws_dna_tIV_sum$sdWS_T4<-apply(ws_dna_tIV_sum[,5:7], 1, sd)
ws_dna_tIV_sum$sdWS_T24<-apply(ws_dna_tIV_sum[,8:10], 1, sd)
ws_dna_tIV_sum<-ws_dna_tIV_sum %>% mutate_at(vars(sdWS_T0,sdWS_T4,sdWS_T24), funs(round(., 0)))
# Write data to .csv file
write.csv(ws_dna_tIV_sum, 'Norm.Results/ws.dna.tIV.mean.sd.csv')
# Subset Mean values
ws_dna_tIV_mean<-subset(ws_dna_tIV_sum, select=c(Tier_IV,meanWS_T0,meanWS_T4,meanWS_T24))
names(ws_dna_tIV_mean)<-c("Tier IV", "WS - T0", "WS - T4", "WS - T24")
# Subset SD values
ws_dna_tIV_sd<-subset(ws_dna_tIV_sum, select=c(Tier_IV,sdWS_T0,sdWS_T4,sdWS_T24))
names(ws_dna_tIV_sd)<-c("Tier IV", "WS - T0", "WS - T4", "WS - T24")
# Remember "Tier IV" as non-numeric values
ws_dna_tIV_mean_TierIV<-ws_dna_tIV_mean$`Tier IV`
ws_dna_tIV_sd_TierIV<-ws_dna_tIV_sd$`Tier IV`
# Transpose all but first column (Tier IV)
ws_dna_tIV_mean<-as.data.frame(t(ws_dna_tIV_mean[,-1]))
colnames(ws_dna_tIV_mean)<-ws_dna_tIV_mean_TierIV
ws_dna_tIV_sd<-as.data.frame(t(ws_dna_tIV_sd[,-1]))
colnames(ws_dna_tIV_sd)<-ws_dna_tIV_sd_TierIV
# Combine mean and sd into single column with ± divider
ws_dna_tIV_table<-as.data.frame(do.call(cbind, lapply(1:ncol(ws_dna_tIV_mean), function(i) paste0(ws_dna_tIV_mean[ , i], " ± ", ws_dna_tIV_sd[ , i]))))
# Transpose the table back
ws_dna_tIV_table<-t(ws_dna_tIV_table)
# Rename columns to Sites
colnames(ws_dna_tIV_table)<-c("Wet Sedge (T0)", "Wet Sedge (T4)", "Wet Sedge (T24)")
rownames(ws_dna_tIV_table)<-ws_dna_tIV_mean_TierIV
#ws_dna_tIV_table
Evaluate differences in metagenome abundance between tussock and wet sedge tundra ecosystems.
Plot the combined Tuss-MG and WS-MG KEGG tier II categories as a barchart.
# Place the KEGG tier II categories in the preferred order for plotting
tuss_ws_dna_tII_bardata$Tier.II <- factor(tuss_ws_dna_tII_bardata$Tier.II,levels = c("Metabolism", "Genetic Information Processing", "Environmental Information Processing", "Cellular Processes"))
tuss_ws_dna_tII_bardata$Sample <- factor(tuss_ws_dna_tII_bardata$Sample,levels=c("Tuss-MG","WS-MG"))
tuss_ws_dna_tII_barplot<-ggplot(tuss_ws_dna_tII_bardata, aes(x = Tier.II, y = Mean, fill = Sample)) + geom_bar(stat = "identity", position = "dodge", color="black") + geom_errorbar(aes(ymin = Mean - SD, ymax = Mean + SD), width = 0.2, position = position_dodge(0.9)) + ylab(expression(atop("KEGG Tier II Categories", paste("Gene Counts")))) + theme_classic() + theme(axis.text=element_text(size=10), axis.title=element_text(size=12), axis.title.x=element_blank()) + theme(legend.position = "bottom", legend.title=element_blank(), legend.text=element_text(size=10)) + scale_size(guide=FALSE) + scale_fill_manual(values=c("gray83","black")) + scale_x_discrete(labels = function(Tier.II) str_wrap(Tier.II, width = 10)) + scale_y_continuous(expand = c(0, 0), limits = c(0, 800000)) + annotate(geom="text", x=1, y=750000, size=3, label="italic(p)==0.007", parse=TRUE) + annotate(geom="text", x=2, y=225000, size=3, label="italic(N.S.)", parse=TRUE) + annotate(geom="text", x=3, y=200000, size=3, label="italic(p)==0.001", parse=TRUE) + annotate(geom="text", x=4, y=110000, size=3, label="italic(p)==0.001", parse=TRUE) + annotate(geom="text", x=2.5, y=780000, label="Paired t-test")
tuss_ws_dna_tII_barplot
Run paired t-tests to determine significance.
# Metabolism
t.test(tuss.ws.dna.tII.t.test$Tuss.Meta,tuss.ws.dna.tII.t.test$WS.Meta,paired=TRUE)
Paired t-test
data: tuss.ws.dna.tII.t.test$Tuss.Meta and tuss.ws.dna.tII.t.test$WS.Meta
t = 4.3494, df = 5, p-value = 0.007363
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
4588.552 17850.448
sample estimates:
mean of the differences
11219.5
# Genetic Information Processing
t.test(tuss.ws.dna.tII.t.test$Tuss.Gene,tuss.ws.dna.tII.t.test$WS.Gene,paired=TRUE)
Paired t-test
data: tuss.ws.dna.tII.t.test$Tuss.Gene and tuss.ws.dna.tII.t.test$WS.Gene
t = -1.0645, df = 5, p-value = 0.3358
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-6398.692 2651.025
sample estimates:
mean of the differences
-1873.833
# Environmental Information Processing
t.test(tuss.ws.dna.tII.t.test$Tuss.Env,tuss.ws.dna.tII.t.test$WS.Env,paired=TRUE)
Paired t-test
data: tuss.ws.dna.tII.t.test$Tuss.Env and tuss.ws.dna.tII.t.test$WS.Env
t = -9.5367, df = 5, p-value = 0.0002145
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-23764.20 -13673.14
sample estimates:
mean of the differences
-18718.67
# Cellular Processes
t.test(tuss.ws.dna.tII.t.test$Tuss.Cell,tuss.ws.dna.tII.t.test$WS.Cell,paired=TRUE)
Paired t-test
data: tuss.ws.dna.tII.t.test$Tuss.Cell and tuss.ws.dna.tII.t.test$WS.Cell
t = 9.1097, df = 5, p-value = 0.0002669
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
6728.004 12017.662
sample estimates:
mean of the differences
9372.833
Plot the combined Tuss-MG and WS-MG KEGG tier III categories as a single barchart.
tuss_ws_dna_tIII_all_barplot
Run statistical tests to determine if significant differences exist between tundra communities for each KEGG Tier III category. Click on the Show/Hide button to see the statistics.
Run paired t-tests to determine significance.
# Amino acid metabolism
t.test(tuss.ws.dna.tIII.t.test$Tuss.AA,tuss.ws.dna.tIII.t.test$WS.AA,paired=TRUE)
Paired t-test
data: tuss.ws.dna.tIII.t.test$Tuss.AA and tuss.ws.dna.tIII.t.test$WS.AA
t = 8.4482, df = 2, p-value = 0.01372
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
4414.838 13579.162
sample estimates:
mean of the differences
8997
# Other amino acid metabolism
t.test(tuss.ws.dna.tIII.t.test$Tuss.OAA,tuss.ws.dna.tIII.t.test$WS.OAA,paired=TRUE)
Paired t-test
data: tuss.ws.dna.tIII.t.test$Tuss.OAA and tuss.ws.dna.tIII.t.test$WS.OAA
t = 16.813, df = 2, p-value = 0.003519
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
1905.621 3216.379
sample estimates:
mean of the differences
2561
# Carbohydrate metabolism
t.test(tuss.ws.dna.tIII.t.test$Tuss.Carb,tuss.ws.dna.tIII.t.test$WS.Carb,paired=TRUE)
Paired t-test
data: tuss.ws.dna.tIII.t.test$Tuss.Carb and tuss.ws.dna.tIII.t.test$WS.Carb
t = 4.3053, df = 2, p-value = 0.04994
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
7.982796 26376.683871
sample estimates:
mean of the differences
13192.33
# Energy metabolism
t.test(tuss.ws.dna.tIII.t.test$Tuss.Ener,tuss.ws.dna.tIII.t.test$WS.Ener,paired=TRUE)
Paired t-test
data: tuss.ws.dna.tIII.t.test$Tuss.Ener and tuss.ws.dna.tIII.t.test$WS.Ener
t = -6.629, df = 2, p-value = 0.02201
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-6218.057 -1323.276
sample estimates:
mean of the differences
-3770.667
# Glycan metabolism
t.test(tuss.ws.dna.tIII.t.test$Tuss.Gly,tuss.ws.dna.tIII.t.test$WS.Gly,paired=TRUE)
Paired t-test
data: tuss.ws.dna.tIII.t.test$Tuss.Gly and tuss.ws.dna.tIII.t.test$WS.Gly
t = 1.2961, df = 2, p-value = 0.3244
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-1153.687 2148.353
sample estimates:
mean of the differences
497.3333
# Lipid metabolism
t.test(tuss.ws.dna.tIII.t.test$Tuss.Lip,tuss.ws.dna.tIII.t.test$WS.Lip,paired=TRUE)
Paired t-test
data: tuss.ws.dna.tIII.t.test$Tuss.Lip and tuss.ws.dna.tIII.t.test$WS.Lip
t = 1.5228, df = 2, p-value = 0.2673
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-1860.803 3899.469
sample estimates:
mean of the differences
1019.333
# Cofactor metabolism
t.test(tuss.ws.dna.tIII.t.test$Tuss.Cofac,tuss.ws.dna.tIII.t.test$WS.Cofac,paired=TRUE)
Paired t-test
data: tuss.ws.dna.tIII.t.test$Tuss.Cofac and tuss.ws.dna.tIII.t.test$WS.Cofac
t = -0.87232, df = 2, p-value = 0.475
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-4769.66 3161.66
sample estimates:
mean of the differences
-804
# Terpenoids metabolism
t.test(tuss.ws.dna.tIII.t.test$Tuss.Terp,tuss.ws.dna.tIII.t.test$WS.Terp,paired=TRUE)
Paired t-test
data: tuss.ws.dna.tIII.t.test$Tuss.Terp and tuss.ws.dna.tIII.t.test$WS.Terp
t = 3.5014, df = 2, p-value = 0.07278
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-246.9969 2405.6636
sample estimates:
mean of the differences
1079.333
# Nucleotide metabolism
t.test(tuss.ws.dna.tIII.t.test$Tuss.Nucleo,tuss.ws.dna.tIII.t.test$WS.Nucleo,paired=TRUE)
Paired t-test
data: tuss.ws.dna.tIII.t.test$Tuss.Nucleo and tuss.ws.dna.tIII.t.test$WS.Nucleo
t = 8.7308, df = 2, p-value = 0.01287
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
1189.181 3500.152
sample estimates:
mean of the differences
2344.667
# Secondary metabolites metabolism
t.test(tuss.ws.dna.tIII.t.test$Tuss.SecMet,tuss.ws.dna.tIII.t.test$WS.SecMet,paired=TRUE)
Paired t-test
data: tuss.ws.dna.tIII.t.test$Tuss.SecMet and tuss.ws.dna.tIII.t.test$WS.SecMet
t = 7.88, df = 2, p-value = 0.01573
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
180.6841 615.3159
sample estimates:
mean of the differences
398
# Xenobiotics metabolism
t.test(tuss.ws.dna.tIII.t.test$Tuss.Xeno,tuss.ws.dna.tIII.t.test$WS.Xeno,paired=TRUE)
Paired t-test
data: tuss.ws.dna.tIII.t.test$Tuss.Xeno and tuss.ws.dna.tIII.t.test$WS.Xeno
t = 11.21, df = 2, p-value = 0.007864
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
3004.829 6748.504
sample estimates:
mean of the differences
4876.667
# Folding, sorting, and degradation
t.test(tuss.ws.dna.tIII.t.test$Tuss.Fold,tuss.ws.dna.tIII.t.test$WS.Fold,paired=TRUE)
Paired t-test
data: tuss.ws.dna.tIII.t.test$Tuss.Fold and tuss.ws.dna.tIII.t.test$WS.Fold
t = 4.7816, df = 2, p-value = 0.04106
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
173.5249 3291.1418
sample estimates:
mean of the differences
1732.333
# Replication
t.test(tuss.ws.dna.tIII.t.test$Tuss.Replic,tuss.ws.dna.tIII.t.test$WS.Replic,paired=TRUE)
Paired t-test
data: tuss.ws.dna.tIII.t.test$Tuss.Replic and tuss.ws.dna.tIII.t.test$WS.Replic
t = -1.5683, df = 2, p-value = 0.2574
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-1915.4712 892.1378
sample estimates:
mean of the differences
-511.6667
# Transcription
t.test(tuss.ws.dna.tIII.t.test$Tuss.Transc,tuss.ws.dna.tIII.t.test$WS.Transc,paired=TRUE)
Paired t-test
data: tuss.ws.dna.tIII.t.test$Tuss.Transc and tuss.ws.dna.tIII.t.test$WS.Transc
t = 5.9916, df = 2, p-value = 0.02674
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
65.0223 396.3110
sample estimates:
mean of the differences
230.6667
# Translation
t.test(tuss.ws.dna.tIII.t.test$Tuss.Transl,tuss.ws.dna.tIII.t.test$WS.Transl,paired=TRUE)
Paired t-test
data: tuss.ws.dna.tIII.t.test$Tuss.Transl and tuss.ws.dna.tIII.t.test$WS.Transl
t = -3.5065, df = 2, p-value = 0.07259
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-6401.9956 652.6622
sample estimates:
mean of the differences
-2874.667
# Membrane Transport
t.test(tuss.ws.dna.tIII.t.test$Tuss.MembTrans,tuss.ws.dna.tIII.t.test$WS.MembTrans,paired=TRUE)
Paired t-test
data: tuss.ws.dna.tIII.t.test$Tuss.MembTrans and tuss.ws.dna.tIII.t.test$WS.MembTrans
t = -11.1, df = 2, p-value = 0.008018
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-30140.37 -13301.63
sample estimates:
mean of the differences
-21721
# Signal Transduction
t.test(tuss.ws.dna.tIII.t.test$Tuss.SignTrans,tuss.ws.dna.tIII.t.test$WS.SignTrans,paired=TRUE)
Paired t-test
data: tuss.ws.dna.tIII.t.test$Tuss.SignTrans and tuss.ws.dna.tIII.t.test$WS.SignTrans
t = 4.367, df = 2, p-value = 0.04864
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
36.98839 4981.01161
sample estimates:
mean of the differences
2509
# Signaling Molecules
t.test(tuss.ws.dna.tIII.t.test$Tuss.SignMol,tuss.ws.dna.tIII.t.test$WS.SignMol,paired=TRUE)
Paired t-test
data: tuss.ws.dna.tIII.t.test$Tuss.SignMol and tuss.ws.dna.tIII.t.test$WS.SignMol
t = -3.8253, df = 2, p-value = 0.06205
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-144.485628 8.485628
sample estimates:
mean of the differences
-68
# Cellular Community
t.test(tuss.ws.dna.tIII.t.test$Tuss.CellComm,tuss.ws.dna.tIII.t.test$WS.CellComm,paired=TRUE)
Paired t-test
data: tuss.ws.dna.tIII.t.test$Tuss.CellComm and tuss.ws.dna.tIII.t.test$WS.CellComm
t = 17.098, df = 2, p-value = 0.003403
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
1849.176 3092.824
sample estimates:
mean of the differences
2471
# Cellular Growth
t.test(tuss.ws.dna.tIII.t.test$Tuss.CellGrow,tuss.ws.dna.tIII.t.test$WS.CellGrow,paired=TRUE)
Paired t-test
data: tuss.ws.dna.tIII.t.test$Tuss.CellGrow and tuss.ws.dna.tIII.t.test$WS.CellGrow
t = 3.5329, df = 2, p-value = 0.07162
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-302.0716 3074.7382
sample estimates:
mean of the differences
1386.333
# Cellular Motility
t.test(tuss.ws.dna.tIII.t.test$Tuss.CellMot,tuss.ws.dna.tIII.t.test$WS.CellMot,paired=TRUE)
Paired t-test
data: tuss.ws.dna.tIII.t.test$Tuss.CellMot and tuss.ws.dna.tIII.t.test$WS.CellMot
t = 8.0641, df = 2, p-value = 0.01503
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
1984.396 6524.270
sample estimates:
mean of the differences
4254.333
# Transport and Catabolism
t.test(tuss.ws.dna.tIII.t.test$Tuss.Transp,tuss.ws.dna.tIII.t.test$WS.Transp,paired=TRUE)
Paired t-test
data: tuss.ws.dna.tIII.t.test$Tuss.Transp and tuss.ws.dna.tIII.t.test$WS.Transp
t = 5.3064, df = 2, p-value = 0.03373
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
176.7351 1691.9315
sample estimates:
mean of the differences
934.3333
Heatmap for MG Tier III categories by sampling timepoints
# Create a vector with categories in the desired order
tuss_ws_dna_tIII_x <- c("Amino acid metabolism ",
"Metabolism of other amino acids ",
"Carbohydrate metabolism ",
"Energy metabolism ",
"Glycan biosynthesis and metabolism ",
"Lipid metabolism ",
"Metabolism of cofactors and vitamins ",
"Metabolism of terpenoids and polyketides ",
"Nucleotide metabolism ",
"Biosynthesis of other secondary metabolites ",
"Xenobiotics biodegradation and metabolism ",
"Folding, sorting and degradation ",
"Replication and repair ",
"Transcription ",
"Translation ",
"Membrane transport ",
"Signal transduction ",
"Signaling molecules and interaction ",
"Cellular community - prokaryotes ",
"Cell growth and death ",
"Cell motility ",
"Transport and catabolism ")
# Re-sort the data in the desired order
tuss_ws_dna_tIII_heatmap<-tuss_ws_dna_tIII_heatmap %>%
slice(match(tuss_ws_dna_tIII_x, Tier_III))
# Convert the first column (Tier categories) into rownames
rownames(tuss_ws_dna_tIII_heatmap) <- tuss_ws_dna_tIII_heatmap$Tier_III
tuss_ws_dna_tIII_heatmap<-as.data.frame(tuss_ws_dna_tIII_heatmap[-1])
# Rename columns to sample IDs (This includes T24 replicates)
colnames(tuss_ws_dna_tIII_heatmap)<-c("Tuss-MG-T0","Tuss-MG-T4","Tuss-MG-T24","WS-MG-T0","WS-MG-T4","WS-MG-T24")
# Convert dataframe into a matrix for heatmap
tuss_ws_dna_tIII_heatmap-as.matrix(tuss_ws_dna_tIII_heatmap)
# Scale matrix values to generate Z-scores
tuss_ws_dna_tIII_heatmap<-scale(t(tuss_ws_dna_tIII_heatmap))
tuss_ws_dna_tIII_heatmap<-t(tuss_ws_dna_tIII_heatmap)
# Specify RColorBrewer custom color palette
col <- colorRampPalette(brewer.pal(10, "RdYlBu"))(256)
### Saving as .eps file with R Viewer
# Re-size the viewer to the following dimensions before saving as .eps file
# In Console, type "tuss_ws_dna_tIII_pheatmap"
# In Plot Viewer, click "Export" --> "Save as Image"
# Width: 550 Height: 650
# Click "Update Preview" --> save as "EPS" file type --> rename as Fig.2.Heat
# Pheatmap
tuss_ws_dna_tIII_pheatmap<-pheatmap(tuss_ws_dna_tIII_heatmap, treeheight_row = 0, treeheight_col = 0, cluster_rows = FALSE, cluster_cols = FALSE)
Potential gene expression related to iron acquisition, iron redox cycling, and iron storage in the metagenomes were determined using FeGenie (Garber et al. 2020). FeGenie provides a comprehensive database of hidden Markov models (HMMs) based on genes related to iron acquisition, storage, and oxidation/reduction in Bacteria and Archaea, which are generally not annotated as such by established gene annotation pipelines, such as GhostKOALA, which was used to annotate the metagenomes (and metatranscriptomes) presented in this study.
Here, the quality-controlled sequencing reads for each metagenome sample were assembled by tundra ecosystem using MEGAHIT to generate a tundra-specific contigs file. The metagenome contigs files were used as input for FeGenie, which first predicted open-reading frames (ORFs) using Prodigal and then queried them against a custom library of HMMs using hmmsearch, with custom bit score cutoffs for each HMM. Results from FeGenie included all identified putative iron-related genes, their functional category, bit score, number of canonical heme-binding motifs, amino acid sequence, and closest homolog. Counts within each iron gene category were summarized as their relative abundance against all identified iron genes.
Assemble QC’d metagenome samples by tundra ecosystem for use in FeGenie
{Terminal}
module load Bioinformatics megahit/1.2.8
cat *_adtrim_clean_qtrim_fwd.derep.fastq > Tuss_DNA_data_fwd.qc.fastq
cat *_adtrim_clean_qtrim_rev.derep.fastq > Tuss_DNA_data_rev.qc.fastq
megahit --k-min 21 --k-max 141 --k-step 12 -1 Tuss_DNA_data_fwd.qc.fastq -2 Tuss_DNA_data_rev.qc.fastq -o ./megahit_tuss_dna_out
cd /gpfs/accounts/gwk_root/gwk1/kjromano/DNA_assembly/WS_DNA
cat *_adtrim_clean_qtrim_fwd.derep.fastq > WS_DNA_data_fwd.qc.fastq
cat *_adtrim_clean_qtrim_rev.derep.fastq > WS_DNA_data_rev.qc.fastq
megahit --k-min 21 --k-max 141 --k-step 12 -1 WS_DNA_data_fwd.qc.fastq -2 WS_DNA_data_rev.qc.fastq -o ./megahit_ws_dna_out
FeGenie is a python script with dependencies that can be downloaded from Arkadiy-Garber GitHub.
Start by loading prerequisite modules
{Terminal}
module load python3.6-anaconda Bioinformatics hmmer prodigal ncbi-blast R
Run the FeGenie python script
{Terminal}
tuss_dna_run.sh
# This is what's inside the "tuss_dna_run.sh" script
FeGenie.py -bin_dir /gpfs/accounts/gwk_root/gwk1/kjromano/DNA_assembly/Tuss_DNA/megahit_tuss_dna_out/ -bin_ext fa -out tuss_dna_out
ws_dna_run.sh
# This is what's inside the "ws_dna_run.sh" script
FeGenie.py -bin_dir /DNA_assembly/WS_DNA/megahit_ws_dna_out/ -bin_ext fa -out WS_dna_out
Summary table of the relative abundance of iron gene categories from the METAGENOMES of Tuss and WS tundra ecosystems.
incomplete final line found by readTableHeader on 'Table.Data/FeGenie.DNA.csv'
The session information is provided for full reproducibility.
devtools::session_info()
─ Session info ─────────────────────────────────────────────────────────────────────────────────────
setting value
version R version 4.1.2 (2021-11-01)
os macOS Big Sur 10.16
system x86_64, darwin17.0
ui RStudio
language (EN)
collate en_US.UTF-8
ctype en_US.UTF-8
tz America/Los_Angeles
date 2022-01-24
rstudio 1.4.1106 Tiger Daylily (desktop)
pandoc 2.11.4 @ /Applications/RStudio.app/Contents/MacOS/pandoc/ (via rmarkdown)
─ Packages ─────────────────────────────────────────────────────────────────────────────────────────
package * version date (UTC) lib source
abind 1.4-5 2016-07-21 [1] CRAN (R 4.1.0)
ape * 5.6-1 2022-01-07 [1] CRAN (R 4.1.2)
assertthat 0.2.1 2019-03-21 [1] CRAN (R 4.1.0)
backports 1.4.1 2021-12-13 [1] CRAN (R 4.1.0)
brio 1.1.3 2021-11-30 [1] CRAN (R 4.1.0)
broom 0.7.11 2022-01-03 [1] CRAN (R 4.1.2)
cachem 1.0.6 2021-08-19 [1] CRAN (R 4.1.0)
callr 3.7.0 2021-04-20 [1] CRAN (R 4.1.0)
car 3.0-12 2021-11-06 [1] CRAN (R 4.1.0)
carData 3.0-5 2022-01-06 [1] CRAN (R 4.1.2)
cellranger 1.1.0 2016-07-27 [1] CRAN (R 4.1.0)
cli 3.1.1 2022-01-20 [1] CRAN (R 4.1.2)
cluster 2.1.2 2021-04-17 [1] CRAN (R 4.1.2)
colorspace 2.0-2 2021-06-24 [1] CRAN (R 4.1.0)
cowplot * 1.1.1 2020-12-30 [1] CRAN (R 4.1.0)
crayon 1.4.2 2021-10-29 [1] CRAN (R 4.1.0)
data.table * 1.14.2 2021-09-27 [1] CRAN (R 4.1.0)
DBI 1.1.2 2021-12-20 [1] CRAN (R 4.1.0)
dbplyr 2.1.1 2021-04-06 [1] CRAN (R 4.1.0)
desc 1.4.0 2021-09-28 [1] CRAN (R 4.1.0)
devtools * 2.4.3 2021-11-30 [1] CRAN (R 4.1.0)
digest 0.6.29 2021-12-01 [1] CRAN (R 4.1.0)
dplyr * 1.0.7 2021-06-18 [1] CRAN (R 4.1.0)
DT * 0.20 2021-11-15 [1] CRAN (R 4.1.0)
ellipsis 0.3.2 2021-04-29 [1] CRAN (R 4.1.0)
evaluate 0.14 2019-05-28 [1] CRAN (R 4.1.0)
fansi 1.0.2 2022-01-14 [1] CRAN (R 4.1.2)
farver 2.1.0 2021-02-28 [1] CRAN (R 4.1.0)
fastmap 1.1.0 2021-01-25 [1] CRAN (R 4.1.0)
forcats * 0.5.1 2021-01-27 [1] CRAN (R 4.1.0)
fs 1.5.2 2021-12-08 [1] CRAN (R 4.1.0)
generics 0.1.1 2021-10-25 [1] CRAN (R 4.1.0)
ggplot2 * 3.3.5 2021-06-25 [1] CRAN (R 4.1.0)
ggpubr * 0.4.0 2020-06-27 [1] CRAN (R 4.1.0)
ggsignif 0.6.3 2021-09-09 [1] CRAN (R 4.1.0)
glue 1.6.0 2021-12-17 [1] CRAN (R 4.1.0)
gridExtra * 2.3 2017-09-09 [1] CRAN (R 4.1.0)
gtable 0.3.0 2019-03-25 [1] CRAN (R 4.1.0)
haven 2.4.3 2021-08-04 [1] CRAN (R 4.1.0)
highr 0.9 2021-04-16 [1] CRAN (R 4.1.0)
hms 1.1.1 2021-09-26 [1] CRAN (R 4.1.0)
htmltools 0.5.2 2021-08-25 [1] CRAN (R 4.1.0)
htmlwidgets 1.5.4 2021-09-08 [1] CRAN (R 4.1.0)
httr 1.4.2 2020-07-20 [1] CRAN (R 4.1.0)
jquerylib 0.1.4 2021-04-26 [1] CRAN (R 4.1.0)
jsonlite 1.7.3 2022-01-17 [1] CRAN (R 4.1.2)
kableExtra * 1.3.4 2021-02-20 [1] CRAN (R 4.1.0)
knitr * 1.37 2021-12-16 [1] CRAN (R 4.1.0)
labeling 0.4.2 2020-10-20 [1] CRAN (R 4.1.0)
lattice * 0.20-45 2021-09-22 [1] CRAN (R 4.1.2)
lifecycle 1.0.1 2021-09-24 [1] CRAN (R 4.1.0)
lubridate 1.8.0 2021-10-07 [1] CRAN (R 4.1.0)
magrittr 2.0.1 2020-11-17 [1] CRAN (R 4.1.0)
MASS 7.3-55 2022-01-13 [1] CRAN (R 4.1.2)
Matrix 1.4-0 2021-12-08 [1] CRAN (R 4.1.0)
memoise 2.0.1 2021-11-26 [1] CRAN (R 4.1.0)
mgcv 1.8-38 2021-10-06 [1] CRAN (R 4.1.2)
modelr 0.1.8 2020-05-19 [1] CRAN (R 4.1.0)
munsell 0.5.0 2018-06-12 [1] CRAN (R 4.1.0)
nlme 3.1-155 2022-01-13 [1] CRAN (R 4.1.2)
permute * 0.9-5 2019-03-12 [1] CRAN (R 4.1.0)
pheatmap * 1.0.12 2019-01-04 [1] CRAN (R 4.1.0)
pillar 1.6.4 2021-10-18 [1] CRAN (R 4.1.0)
pkgbuild 1.3.1 2021-12-20 [1] CRAN (R 4.1.0)
pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.1.0)
pkgload 1.2.4 2021-11-30 [1] CRAN (R 4.1.0)
plyr 1.8.6 2020-03-03 [1] CRAN (R 4.1.0)
png * 0.1-7 2013-12-03 [1] CRAN (R 4.1.0)
prettyunits 1.1.1 2020-01-24 [1] CRAN (R 4.1.0)
processx 3.5.2 2021-04-30 [1] CRAN (R 4.1.0)
ps 1.6.0 2021-02-28 [1] CRAN (R 4.1.0)
purrr * 0.3.4 2020-04-17 [1] CRAN (R 4.1.0)
R6 2.5.1 2021-08-19 [1] CRAN (R 4.1.0)
RColorBrewer * 1.1-2 2014-12-07 [1] CRAN (R 4.1.0)
Rcpp 1.0.8 2022-01-13 [1] CRAN (R 4.1.2)
readr * 2.1.1 2021-11-30 [1] CRAN (R 4.1.0)
readxl 1.3.1 2019-03-13 [1] CRAN (R 4.1.0)
remotes 2.4.2 2021-11-30 [1] CRAN (R 4.1.0)
reprex 2.0.1 2021-08-05 [1] CRAN (R 4.1.0)
reshape * 0.8.8 2018-10-23 [1] CRAN (R 4.1.0)
rlang 0.4.12 2021-10-18 [1] CRAN (R 4.1.0)
rmarkdown 2.11 2021-09-14 [1] CRAN (R 4.1.0)
rprojroot 2.0.2 2020-11-15 [1] CRAN (R 4.1.0)
rsconnect 0.8.25 2021-11-19 [1] CRAN (R 4.1.0)
rstatix * 0.7.0 2021-02-13 [1] CRAN (R 4.1.0)
rstudioapi 0.13 2020-11-12 [1] CRAN (R 4.1.0)
rvest 1.0.2 2021-10-16 [1] CRAN (R 4.1.0)
scales 1.1.1 2020-05-11 [1] CRAN (R 4.1.0)
sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.1.0)
statmod * 1.4.36 2021-05-10 [1] CRAN (R 4.1.0)
stringi 1.7.6 2021-11-29 [1] CRAN (R 4.1.0)
stringr * 1.4.0 2019-02-10 [1] CRAN (R 4.1.0)
svglite 2.0.0 2021-02-20 [1] CRAN (R 4.1.0)
systemfonts 1.0.3 2021-10-13 [1] CRAN (R 4.1.2)
testthat 3.1.2 2022-01-20 [1] CRAN (R 4.1.2)
tibble * 3.1.6 2021-11-07 [1] CRAN (R 4.1.0)
tidyr * 1.1.4 2021-09-27 [1] CRAN (R 4.1.0)
tidyselect 1.1.1 2021-04-30 [1] CRAN (R 4.1.0)
tidyverse * 1.3.1 2021-04-15 [1] CRAN (R 4.1.0)
tinytex 0.36 2021-12-19 [1] CRAN (R 4.1.0)
tzdb 0.2.0 2021-10-27 [1] CRAN (R 4.1.0)
usethis * 2.1.5 2021-12-09 [1] CRAN (R 4.1.0)
utf8 1.2.2 2021-07-24 [1] CRAN (R 4.1.0)
vctrs 0.3.8 2021-04-29 [1] CRAN (R 4.1.0)
vegan * 2.5-7 2020-11-28 [1] CRAN (R 4.1.0)
viridisLite 0.4.0 2021-04-13 [1] CRAN (R 4.1.0)
webshot 0.5.2 2019-11-22 [1] CRAN (R 4.1.0)
withr 2.4.3 2021-11-30 [1] CRAN (R 4.1.0)
xfun 0.29 2021-12-14 [1] CRAN (R 4.1.0)
xml2 1.3.3 2021-11-30 [1] CRAN (R 4.1.0)
yaml * 2.2.1 2020-02-01 [1] CRAN (R 4.1.0)
[1] /Library/Frameworks/R.framework/Versions/4.1/Resources/library
────────────────────────────────────────────────────────────────────────────────────────────────────