library(devtools)
library(BiocManager)
#BiocManager::install("ShortRead")
#BiocManager::install("dada2")
library(dada2)
library(dplyr)
library(vegan)
library(ade4)
library(ggplot2)

#https://mothur.s3.us-east-2.amazonaws.com/wiki/miseqsopdata.zip download the sample dataset from the mothur website

#https://zenodo.org/records/1172783/files/silva_nr_v132_train_set.fa.gz?download=1 you will also need to download the reference library to assign species to our ASVs

setwd("~/Desktop/Bioinformatics_Class/Metabarcoding_HW/")

Dataset

In this assignment, we are analyzing a subset of samples from an experiment that compared gut microbiome communities in two groups of lab mice, one tested directly after weaning (“Early” group) and one several weeks after weaning (“Late” group). The samples were tested using metabarcoding primers focused on the 16S V4 region of rRNA, one of the best-studied areas used for identification of prokaryotes.

Part 1: Reading in Sample Data & Quality Control

Read in the samples from the example data set. This dataset uses 250bp paired-end sequencing data, so there are two files for each of the 20 samples, one for the forward reads (R1_001) and one for the reverse reads (R2_001).

Loading Data & Data Exploration

path <- "./MiSeq_SOP/" #set the path to the samples, this will work so long as the MiSeq_SOP folder has been unzipped in your current working directory, otherwise write out the full path
fnFs <- sort(list.files(path, pattern="_R1_001.fastq", full.names = T)) #read the forward read file names
fnRs <- sort(list.files(path, pattern="_R2_001.fastq", full.names = T)) #read the reverse read file names
sample.names <- sapply(strsplit(basename(fnFs), "_"),`[`, 1) #clip the sample names from the first part of each of the forward read file names
sample.names #print the sample names to ensure the above steps worked
##  [1] "F3D0"   "F3D1"   "F3D141" "F3D142" "F3D143" "F3D144" "F3D145" "F3D146"
##  [9] "F3D147" "F3D148" "F3D149" "F3D150" "F3D2"   "F3D3"   "F3D5"   "F3D6"  
## [17] "F3D7"   "F3D8"   "F3D9"   "Mock"
plotQualityProfile(fnFs) #plot base calling quality plots for the forward reads

plotQualityProfile(fnRs) #plot base calling quality plots for the reverse reads

Question 1: Are there any problems you see with the data sequencing quality? The DADA2 model that sorts sequences into ARVs incorporates quality scores into its assessments, but it does help if we can remove large sources of potential errors. What filtering do you think would be helpful for this dataset?

Filtering

Below you will create filepaths where filtered versions of the sample data will be written. You will also run the filterandTrim function where you will need to set the following parameters:

  • fwd = forward read filenames
  • filt = filenames for the filtered forward read output files
  • rev = reverse read filenames
  • rev.filt = filenames for the filtered reverse read output files
  • truncQ = an integer representing the lowest quality base call where a read will be truncated
  • truncLen = an integer of the truncation length you want to set for the forward and reverse reads, respectively. Any reads with a shorter length than this will be discarded
  • trimLeft = a vector of two integers for the number of bases you would like to remove from the start of the reads (this occurs after truncation)
  • trimRight = a vector of two integers for the number of bases you would like to remove from the end of the reads (this occurs after truncation)
  • maxLen = maximum length of reads you want to include in the filtered dataset (filter applied before truncation & trimming)
  • minLen = minimum length of reads you want to include in the filtered dataset (filter applied after truncation & trimming)
  • maxN = the maximum number of ambiguous bases you want to allow in a read before filtering it
  • minQ = minimum quality score for any bases in a read after truncation and trimming
  • maxEE = maximum value for the expected errors in a read, based on the Q scores across all bases
  • rm.phix = removes reads that align to the PhiX bacteriophage genome, which is produced commercially as a standard to test the accuracy of Illumina sequencers
  • rm.lowcomplex = an integer representing the effective number of kmers; sequences with a lot of repetitive base pairs will have lower complexity scores

Note: You can set different values for the forward and reverse reads for the integer parameters with a two-integer vector (e.g. c(1,2)). Note2: These are the default parameters from the DADA2 tutorial I used. Feel free to experiment around with them to see how they affect (and might be able to improve) the quality of the data.

filtFs <- file.path(path, "filtered", paste0(sample.names, "_F_filt.fastq.gz"))
filtRs <- file.path(path, "filtered", paste0(sample.names, "_R_filt.fastq.gz"))
names(filtFs) <- sample.names
names(filtRs) <- sample.names
out <- filterAndTrim(fwd=fnFs, filt=filtFs, rev=fnRs, filt.rev=filtRs, truncQ = 2, truncLen = c(240,240), trimLeft=0, maxLen=Inf, minLen=0, maxN=0, minQ=0, maxEE = c(2,2), rm.phix = T, rm.lowcomplex = 0, multithread = T, compress = T)

Question 2: Plot the quality plots of the samples after filtering. What differences do you see in the filtered samples versus the original dataset?

plotQualityProfile(filtFs)

plotQualityProfile(filtRs)

Error Models

DADA2 implements a machine learning algorithm to attempt to estimate the number of errors due to sequencing, PCR errors, and other sources that have been artificially introduced into the data. This information is necessary for building the ASVs. When creating ASVs, DADA2 needs to be able to assess whether base pair differences between two sequences are real or likely due to some mistake in the sequencing process before deciding whether to split them into separate ASVs or lump them together.

errF <- learnErrors(filtFs, multithread=T)
## 27263040 total bases in 113596 reads from 20 samples will be used for learning the error rates.
errR <- learnErrors(filtRs, multithread=T)
## 27263040 total bases in 113596 reads from 20 samples will be used for learning the error rates.

Here, we can plot the predicted error rate for each combination of base pairs. You should generally see that the predicted error rate falls as quality (Phred) scores get higher. Along the diagonal are bases correctly called and they should be near 1 across all Phred scores (but may dip lower for lower quality bases).

plotErrors(errF, nominalQ=T)
## Warning in scale_y_log10(): log-10 transformation introduced infinite values.
## log-10 transformation introduced infinite values.

plotErrors(errR, nominalQ=T)
## Warning in scale_y_log10(): log-10 transformation introduced infinite values.
## log-10 transformation introduced infinite values.

Identify Unique Sequences in Each Sample (ASVs)

Here, supply DADA with the error models built in the last step and it will identify sequences it believes are true variants in each of the samples. You can process each sample independently, or you can pool all the samples together, which will improve your ability to identify rare variants.

dadaFs <- dada(filtFs, errF, multithread = T)
## Sample 1 - 6038 reads in 1654 unique sequences.
## Sample 2 - 4672 reads in 1407 unique sequences.
## Sample 3 - 4477 reads in 1200 unique sequences.
## Sample 4 - 2297 reads in 684 unique sequences.
## Sample 5 - 2361 reads in 753 unique sequences.
## Sample 6 - 3287 reads in 963 unique sequences.
## Sample 7 - 4952 reads in 1272 unique sequences.
## Sample 8 - 3807 reads in 1207 unique sequences.
## Sample 9 - 11439 reads in 2608 unique sequences.
## Sample 10 - 9055 reads in 2127 unique sequences.
## Sample 11 - 9724 reads in 2410 unique sequences.
## Sample 12 - 4036 reads in 1247 unique sequences.
## Sample 13 - 15017 reads in 3028 unique sequences.
## Sample 14 - 5003 reads in 1164 unique sequences.
## Sample 15 - 3289 reads in 982 unique sequences.
## Sample 16 - 6206 reads in 1530 unique sequences.
## Sample 17 - 3917 reads in 966 unique sequences.
## Sample 18 - 4285 reads in 1195 unique sequences.
## Sample 19 - 5651 reads in 1477 unique sequences.
## Sample 20 - 4083 reads in 792 unique sequences.
dadaRs <- dada(filtRs, errR, multithread = T)
## Sample 1 - 6038 reads in 2989 unique sequences.
## Sample 2 - 4672 reads in 2347 unique sequences.
## Sample 3 - 4477 reads in 2395 unique sequences.
## Sample 4 - 2297 reads in 1342 unique sequences.
## Sample 5 - 2361 reads in 1375 unique sequences.
## Sample 6 - 3287 reads in 1983 unique sequences.
## Sample 7 - 4952 reads in 2844 unique sequences.
## Sample 8 - 3807 reads in 2097 unique sequences.
## Sample 9 - 11439 reads in 5855 unique sequences.
## Sample 10 - 9055 reads in 4423 unique sequences.
## Sample 11 - 9724 reads in 4894 unique sequences.
## Sample 12 - 4036 reads in 2274 unique sequences.
## Sample 13 - 15017 reads in 6975 unique sequences.
## Sample 14 - 5003 reads in 2552 unique sequences.
## Sample 15 - 3289 reads in 1868 unique sequences.
## Sample 16 - 6206 reads in 3179 unique sequences.
## Sample 17 - 3917 reads in 2073 unique sequences.
## Sample 18 - 4285 reads in 2142 unique sequences.
## Sample 19 - 5651 reads in 2797 unique sequences.
## Sample 20 - 4083 reads in 1367 unique sequences.

Merge the paired-end reads

Here we can merge the paired-end reads into one sequence. DADA expects there to be a region that overlaps between both reads and will remove any pairs that don’t have a perfectly overlapping region.

mergers <- mergePairs(dadaF = dadaFs, filtFs, dadaR = dadaRs, filtRs)
head(mergers[[1]])
##                                                                                                                                                                                                                                                        sequence
## 1  TACGGAGGATGCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGTGCGTAGGCGGCCTGCCAAGTCAGCGGTAAAATTGCGGGGCTCAACCCCGTACAGCCGTTGAAACTGCCGGGCTCGAGTGGGCGAGAAGTATGCGGAATGCGTGGTGTAGCGGTGAAATGCATAGATATCACGCAGAACCCCGATTGCGAAGGCAGCATACCGGCGCCCTACTGACGCTGAGGCACGAAAGTGCGGGGATCAAACAGG
## 2  TACGGAGGATGCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGTGCGCAGGCGGAAGATCAAGTCAGCGGTAAAATTGAGAGGCTCAACCTCTTCGAGCCGTTGAAACTGGTTTTCTTGAGTGAGCGAGAAGTATGCGGAATGCGTGGTGTAGCGGTGAAATGCATAGATATCACGCAGAACTCCGATTGCGAAGGCAGCATACCGGCGCTCAACTGACGCTCATGCACGAAAGTGTGGGTATCGAACAGG
## 3  TACGGAGGATGCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGTGCGTAGGCGGGCTTTTAAGTCAGCGGTAAAAATTCGGGGCTCAACCCCGTCCGGCCGTTGAAACTGGGGGCCTTGAGTGGGCGAGAAGAAGGCGGAATGCGTGGTGTAGCGGTGAAATGCATAGATATCACGCAGAACCCCGATTGCGAAGGCAGCCTTCCGGCGCCCTACTGACGCTGAGGCACGAAAGTGCGGGGATCGAACAGG
## 5  TACGGAGGATGCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGTGCGCAGGCGGACTCTCAAGTCAGCGGTCAAATCGCGGGGCTCAACCCCGTTCCGCCGTTGAAACTGGGAGCCTTGAGTGCGCGAGAAGTAGGCGGAATGCGTGGTGTAGCGGTGAAATGCATAGATATCACGCAGAACTCCGATTGCGAAGGCAGCCTACCGGCGCGCAACTGACGCTCATGCACGAAAGCGTGGGTATCGAACAGG
## 6 TACGTAGGGGGCAAGCGTTATCCGGAATTACTGGGTGTAAAGGGAGCGTAGACGGTAATGCAAGTCTGGAGTGAAAGGCGGGGGCCCAACCCCCGGACTGCTCTGGAAACTGTGTAACTGGAGTGCAGGAGAGGCAGGCGGAATTCCTAGTGTAGCGGTGAAATGCGTAGATATTAGGAGGAACACCAGTGGCGAAGGCGGCCTGCTGGACTGTAACTGACGTTGAGGCTCGAAAGCGTGGGGAGCAAACAGG
## 7  TACGGAGGATGCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGTGCGTAGGCGGGATGCCAAGTCAGCGGTAAAAAAGCGGTGCTCAACGCCGTCGAGCCGTTGAAACTGGCGTTCTTGAGTGGGCGAGAAGTATGCGGAATGCGTGGTGTAGCGGTGAAATGCATAGATATCACGCAGAACTCCGATTGCGAAGGCAGCATACCGGCGCCCTACTGACGCTGAGGCACGAAAGCGTGGGTATCGAACAGG
##   abundance forward reverse nmatch nmismatch nindel prefer accept
## 1       442       2       3    228         0      0      1   TRUE
## 2       426       1      15    228         0      0      1   TRUE
## 3       419       3       1    228         0      0      1   TRUE
## 5       283       5       7    228         0      0      1   TRUE
## 6       276       9       2    227         0      0      2   TRUE
## 7       262       6       6    228         0      0      1   TRUE

You should have fairly high success when merging reads. If you have lost many reads (>20%), you may want to go back and make sure you are not truncating/trimming the overlapping parts of reads and/or apply more stringent base quality filtering.

merge_successF <- NULL
for(i in 1:length(mergers)){
  merge_successF[i] <- sum(mergers[[i]]$abundance)/sum(dadaFs[[i]]@.Data[[2]]$abundance)
}
merge_successR <- NULL
for(i in 1:length(mergers)){
  merge_successR[i] <- sum(mergers[[i]]$abundance)/sum(dadaRs[[i]]@.Data[[2]]$abundance)
}
merge_successF
##  [1] 0.9080149 0.9383011 0.8620137 0.9212454 0.8660436 0.7674125 0.9261801
##  [8] 0.8563729 0.8655589 0.8213525 0.8727863 0.8197060 0.9502855 0.8515195
## [15] 0.9134131 0.8702177 0.9160986 0.9028176 0.9546362 0.9635468
merge_successR
##  [1] 0.9086294 0.9486053 0.8639908 0.9345100 0.8660436 0.7769623 0.9238747
##  [8] 0.8723579 0.8676405 0.8396194 0.8726949 0.8184394 0.9634902 0.8546571
## [15] 0.9177475 0.8677849 0.9182654 0.9026021 0.9510123 0.9630724

Make Sequence Table

Now we can combine sequence abundance data from across all of the samples into a single table.

seqtab <- makeSequenceTable(mergers)
dim(seqtab) #print the dimensions of the ASV count x sample table
## [1]  20 253
rowSums(seqtab > 0) #calculate the number of ASVs in a sample with at least one sequence found
##   F3D0   F3D1 F3D141 F3D142 F3D143 F3D144 F3D145 F3D146 F3D147 F3D148 F3D149 
##     99     90     68     45     52     41     67     77    111    101    114 
## F3D150   F3D2   F3D3   F3D5   F3D6   F3D7   F3D8   F3D9   Mock 
##     74    136     63     71     85     59     92     97     20
table(nchar(getSequences(seqtab))) #print the lengths of the representative sequences used to define each ASV
## 
## 251 252 253 254 255 
##   1  52 192   6   2

Question 3: How many ASVs did you find across all samples? Which sample had the highest ASV richness? The lowest? What is the range in lengths of the ASV representative sequences?

Remove Chimeras

Now that we have our de-noised ASVs, we can look for chimeras - merged paired sequences that were incorrectly paired with the wrong mate, often due to an error in the PCR phase. Most of the time, these chimeras will be relatively rare sequences that have partial alignments to two different more common sequences in the dataset.

seqtab_nochimera <- removeBimeraDenovo(seqtab, method="consensus", multithread=T)
ncol(seqtab_nochimera)/ncol(seqtab) #see the proportion of ASVs remaining after chimera filtering
## [1] 0.9209486
sum(seqtab_nochimera/sum(seqtab)) #see the proportion of the total sequence count after chimera filtering
## [1] 0.9807675

Question 4: How many ASVs did you identify as chimeric? What proportion of the total count was made up of chimeras?

Summary of the Filtering Steps

getN <- function(x) sum(getUniques(x))
track <- cbind(out, sapply(dadaFs, getN), sapply(dadaRs, getN), sapply(mergers, getN), rowSums(seqtab_nochimera), (rowSums(seqtab_nochimera)/out[,1])*100)
colnames(track) <- c("input", "filtered", "denoisedF", "denoisedR", "merged", "non-chimeric", "final-kept(%)")
row.names(track) <- sample.names
track
##        input filtered denoisedF denoisedR merged non-chimeric final-kept(%)
## F3D0    7793     6038      5914      5910   5370         5370      68.90799
## F3D1    5869     4672      4603      4553   4319         4319      73.59005
## F3D141  5958     4477      4370      4360   3767         3711      62.28600
## F3D142  3183     2297      2184      2153   2012         1960      61.57713
## F3D143  3178     2361      2247      2247   1946         1931      60.76149
## F3D144  4827     3287      3173      3134   2435         2395      49.61674
## F3D145  7377     4952      4809      4821   4454         4358      59.07550
## F3D146  5021     3807      3711      3643   3178         3137      62.47759
## F3D147 17070    11439     11254     11227   9741         9276      54.34095
## F3D148 12405     9055      8917      8723   7324         7109      57.30754
## F3D149 13083     9724      9543      9544   8329         8163      62.39395
## F3D150  5509     4036      3877      3883   3178         3178      57.68742
## F3D2   19620    15017     14885     14681  14145        13823      70.45362
## F3D3    6758     5003      4903      4885   4175         3981      58.90796
## F3D5    4448     3289      3176      3161   2901         2901      65.22032
## F3D6    7989     6206      6064      6081   5277         5223      65.37739
## F3D7    5129     3917      3814      3805   3494         3372      65.74381
## F3D8    5294     4285      4188      4189   3781         3768      71.17491
## F3D9    7070     5651      5511      5532   5261         5208      73.66337
## Mock    4779     4083      4060      4062   3912         3912      81.85813

There should not be any major drops in the number of sequences in the samples for any of the steps, and the final counts should include most of the original reads.

Question 5: Do you see any probelmatic samples and why would you consider removing them? Likewise, are there any steps that resulted in a large decline in reads that you might want to revisit?

Assign Taxonomy

taxa <- assignTaxonomy(seqtab_nochimera, "./silva_nr_v132_train_set.fa.gz", multithread = T)
taxa.print <- taxa
rownames(taxa.print) <- NULL #removing the rownames (which are full sequences) for display purposes
head(taxa.print)
##      Kingdom    Phylum          Class         Order           Family          
## [1,] "Bacteria" "Bacteroidetes" "Bacteroidia" "Bacteroidales" "Muribaculaceae"
## [2,] "Bacteria" "Bacteroidetes" "Bacteroidia" "Bacteroidales" "Muribaculaceae"
## [3,] "Bacteria" "Bacteroidetes" "Bacteroidia" "Bacteroidales" "Muribaculaceae"
## [4,] "Bacteria" "Bacteroidetes" "Bacteroidia" "Bacteroidales" "Bacteroidaceae"
## [5,] "Bacteria" "Bacteroidetes" "Bacteroidia" "Bacteroidales" "Rikenellaceae" 
## [6,] "Bacteria" "Bacteroidetes" "Bacteroidia" "Bacteroidales" "Muribaculaceae"
##      Genus        
## [1,] NA           
## [2,] NA           
## [3,] NA           
## [4,] "Bacteroides"
## [5,] "Alistipes"  
## [6,] NA

We can check the accuracy of the taxonomy assignment by comparing the results to what is expected given a “mock” community constructed of known taxa that was sequenced alongside the other samples.

unqs.mock <- seqtab_nochimera["Mock",]
unqs.mock <- sort(unqs.mock[unqs.mock > 0], decreasing = T) #drop ASVs missing from the Mock community
cat("DADA2 inferred", length(unqs.mock), "sample sequences present in the mock community.\n")
## DADA2 inferred 20 sample sequences present in the mock community.
mock.ref <- getSequences(file.path(path, "HMP_MOCK.v35.fasta"))
match.ref <- sum(sapply(names(unqs.mock), function(x) any(grepl(x, mock.ref))))
cat("of those,", sum(match.ref), "were exact matches to expected reference sequences.\n")
## of those, 20 were exact matches to expected reference sequences.

Question 6: What is the taxonomy assignment error rate you found based on the correct assignment of mock community members to the reference sequences?

Community Data Exploration

mouse_data.df <- read.table("./MiSeq_SOP/mouse.time.design", header=T)
mouse_community.df <- merge(mouse_data.df, as.data.frame(seqtab_nochimera), by.x="group", by.y="row.names")

Rarefaction and Species Accumulation Curves

Rarefaction and species accumulation curves are used to assess whether or not our sampling was effective enough to capture most of the variation present in the communities we are interested in.

Rarefaction curves are used to make sure we recovered enough reads in our samples to be confident we are representing close to the actual species richness present. With species accumulation curves, we are assessing wheter we sampled enough sites to capture most of the diversity present in these communities. In both cases, we should see arcs that level off near an asymptote (i.e. the inferred “true” number of species/ASVs present).

rarecurve(seqtab_nochimera[-20,], sample=min(rowSums(seqtab_nochimera[-20,])), col="red")

mouse_specaccum <- specaccum(seqtab_nochimera[-20,])
mouse_specaccum.df <- data.frame(sites=mouse_specaccum$sites, richness=mouse_specaccum$richness, sd=mouse_specaccum$sd)
ggplot(data=mouse_specaccum.df)+
  geom_line(aes(x=sites, y=richness))+
  geom_ribbon(aes(x=sites, ymin=richness-sd, ymax=richness+sd), alpha=0.5)

Question 7: Have we sampled enough to capture most of the diversity in the communities we are analyzing?

Principal Components Analysis

mouse.pca <- dudi.pca(mouse_community.df[,-c(1:2)], scannf=F, nf=2)
ggplot(data=mouse.pca$li, aes(x=Axis1, y=Axis2, color=mouse_community.df$time))+
  geom_point()+
  stat_ellipse()+
  labs(color="Time Group")+
  theme_classic()

loadings_taxonomy <- merge(mouse.pca$co, taxa, by="row.names") #attach PC loadings to the taxonomy assignment to see what taxa are driving variation
head(sort_by(loadings_taxonomy[,-1], abs(loadings_taxonomy$Comp1), decreasing=T), n = 20)
##          Comp1        Comp2  Kingdom         Phylum               Class
## 136 -0.9707309  0.075193540 Bacteria     Firmicutes          Clostridia
## 50  -0.9698994 -0.023367828 Bacteria     Firmicutes          Clostridia
## 138 -0.9643137  0.014661595 Bacteria     Firmicutes          Clostridia
## 132 -0.9411651  0.060223986 Bacteria     Firmicutes          Clostridia
## 198 -0.9361116 -0.128905332 Bacteria     Firmicutes          Clostridia
## 81  -0.9338237  0.083838976 Bacteria     Firmicutes          Clostridia
## 143 -0.9332400  0.160635075 Bacteria     Firmicutes          Clostridia
## 196 -0.9327119  0.108762840 Bacteria     Firmicutes          Clostridia
## 133 -0.9317057  0.008309817 Bacteria     Firmicutes          Clostridia
## 83  -0.9300247  0.027069724 Bacteria     Firmicutes          Clostridia
## 222 -0.9047067  0.110638957 Bacteria     Firmicutes          Clostridia
## 90  -0.9031631 -0.143833727 Bacteria     Firmicutes          Clostridia
## 162 -0.9009005  0.090411192 Bacteria     Firmicutes          Clostridia
## 78  -0.8992363 -0.157644005 Bacteria     Firmicutes          Clostridia
## 47  -0.8888190 -0.076713750 Bacteria     Firmicutes          Clostridia
## 5   -0.8855748  0.053158401 Bacteria Proteobacteria Gammaproteobacteria
## 192 -0.8843518 -0.022641515 Bacteria     Firmicutes          Clostridia
## 145 -0.8720650 -0.010608597 Bacteria     Firmicutes          Clostridia
## 98  -0.8683911 -0.012190885 Bacteria     Firmicutes          Clostridia
## 86  -0.8664072  0.236371871 Bacteria     Firmicutes          Clostridia
##               Order           Family                         Genus
## 136   Clostridiales  Lachnospiraceae                 Acetatifactor
## 50    Clostridiales  Ruminococcaceae           Ruminiclostridium_5
## 138   Clostridiales  Lachnospiraceae Lachnospiraceae_NK4A136_group
## 132   Clostridiales  Lachnospiraceae Lachnospiraceae_NK4A136_group
## 198   Clostridiales  Ruminococcaceae                 Oscillibacter
## 81    Clostridiales  Lachnospiraceae                 Acetatifactor
## 143   Clostridiales  Lachnospiraceae                          <NA>
## 196   Clostridiales  Ruminococcaceae                 Oscillibacter
## 133   Clostridiales  Lachnospiraceae Lachnospiraceae_NK4A136_group
## 83    Clostridiales  Lachnospiraceae                 Acetatifactor
## 222   Clostridiales  Lachnospiraceae                          <NA>
## 90    Clostridiales  Lachnospiraceae       Lachnospiraceae_UCG-006
## 162   Clostridiales  Lachnospiraceae       Lachnospiraceae_UCG-001
## 78    Clostridiales  Lachnospiraceae                     Roseburia
## 47    Clostridiales  Ruminococcaceae      Hydrogenoanaerobacterium
## 5   Pseudomonadales Pseudomonadaceae                   Pseudomonas
## 192   Clostridiales  Ruminococcaceae                Intestinimonas
## 145   Clostridiales  Lachnospiraceae                          <NA>
## 98    Clostridiales  Lachnospiraceae Lachnospiraceae_NK4A136_group
## 86    Clostridiales  Lachnospiraceae Lachnospiraceae_NK4A136_group
head(sort_by(loadings_taxonomy[,-1], abs(loadings_taxonomy$Comp2), decreasing=T), n = 20)
##            Comp1      Comp2  Kingdom        Phylum       Class           Order
## 104  0.176725227 -0.9040260 Bacteria    Firmicutes  Clostridia   Clostridiales
## 116 -0.202811449 -0.8758021 Bacteria    Firmicutes  Clostridia   Clostridiales
## 25   0.078206792 -0.8747599 Bacteria Bacteroidetes Bacteroidia   Bacteroidales
## 166  0.018459346 -0.8733230 Bacteria    Firmicutes  Clostridia   Clostridiales
## 135 -0.115286615 -0.8585803 Bacteria    Firmicutes  Clostridia   Clostridiales
## 152  0.006966416 -0.8577845 Bacteria    Firmicutes  Clostridia   Clostridiales
## 231  0.166537291 -0.8508686 Bacteria    Firmicutes  Clostridia   Clostridiales
## 129  0.245881802 -0.8470250 Bacteria    Firmicutes  Clostridia   Clostridiales
## 62   0.207283994 -0.8376596 Bacteria    Firmicutes  Clostridia   Clostridiales
## 20   0.033176735 -0.8283804 Bacteria Bacteroidetes Bacteroidia   Bacteroidales
## 108  0.151222392 -0.8206836 Bacteria    Firmicutes  Clostridia   Clostridiales
## 123  0.279416133 -0.8090054 Bacteria    Firmicutes  Clostridia   Clostridiales
## 84   0.190463409 -0.7957913 Bacteria    Firmicutes  Clostridia   Clostridiales
## 205  0.334205622 -0.7845644 Bacteria    Firmicutes     Bacilli Lactobacillales
## 112  0.141555548 -0.7828789 Bacteria    Firmicutes  Clostridia   Clostridiales
## 165 -0.551218636 -0.7779333 Bacteria    Firmicutes  Clostridia   Clostridiales
## 19   0.262117213 -0.7670300 Bacteria Bacteroidetes Bacteroidia   Bacteroidales
## 97   0.324920160 -0.7668876 Bacteria    Firmicutes  Clostridia   Clostridiales
## 229  0.207286198 -0.7608486 Bacteria    Firmicutes  Clostridia   Clostridiales
## 102  0.089002664 -0.7550471 Bacteria    Firmicutes  Clostridia   Clostridiales
##               Family                         Genus
## 104  Lachnospiraceae Lachnospiraceae_NK4A136_group
## 116  Lachnospiraceae                          <NA>
## 25    Muribaculaceae                          <NA>
## 166  Lachnospiraceae                          <NA>
## 135  Lachnospiraceae                 Acetatifactor
## 152  Lachnospiraceae                          <NA>
## 231  Lachnospiraceae                     Roseburia
## 129  Lachnospiraceae                          <NA>
## 62   Ruminococcaceae                 Anaerotruncus
## 20    Muribaculaceae                          <NA>
## 108  Lachnospiraceae                          <NA>
## 123  Lachnospiraceae             Lachnoclostridium
## 84   Lachnospiraceae Lachnospiraceae_NK4A136_group
## 205 Lactobacillaceae                 Lactobacillus
## 112  Lachnospiraceae                          <NA>
## 165  Lachnospiraceae                          <NA>
## 19    Muribaculaceae                          <NA>
## 97   Lachnospiraceae Lachnospiraceae_NK4A136_group
## 229  Lachnospiraceae                     Roseburia
## 102  Lachnospiraceae                 Acetatifactor

Question 8: What groups of taxa seem to be driving variation in PC1? PC2?

Permuted MANOVA

PERMANOVAs are a method where we can test if treatment groups have significantly different community compositions. They are based on distance matrices of our community data (here we’ll use Bray-Curtis distances, a common metric used for both traditional species counts and eDNA community data), and because they use a permuted bootstrapping testing structure, do not rely on assumptions of empirical distributions like a typical ANOVA or MANOVA.

adonis2(mouse_community.df[,-c(1:2)] ~ time, data=mouse_community.df, method="bray")
## Permutation test for adonis under reduced model
## Permutation: free
## Number of permutations: 999
## 
## adonis2(formula = mouse_community.df[, -c(1:2)] ~ time, data = mouse_community.df, method = "bray")
##          Df SumOfSqs      R2      F Pr(>F)    
## Model     1  0.64399 0.27135 6.3309  0.001 ***
## Residual 17  1.72928 0.72865                  
## Total    18  2.37327 1.00000                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Question 9: Based on the PERMANOVA results and the principal components, what evidence supports (or does not support) that time had an effect on the gut microbiota of these mice?