R amplicon analysis part 1

load packages

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0      ✔ purrr   0.3.5 
## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.2.1      ✔ stringr 1.4.1 
## ✔ readr   2.1.3      ✔ forcats 0.5.2 
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library(phyloseq)
library (readr)
library(seqinr)
## 
## Attaching package: 'seqinr'
## 
## The following object is masked from 'package:dplyr':
## 
##     count
library(decontam)
library(ape)
## 
## Attaching package: 'ape'
## 
## The following objects are masked from 'package:seqinr':
## 
##     as.alignment, consensus
library(vegan)
## Loading required package: permute
## 
## Attaching package: 'permute'
## 
## The following object is masked from 'package:seqinr':
## 
##     getType
## 
## Loading required package: lattice
## This is vegan 2.6-4

Import data files

# Import Count table. Skip first row of tsv file, which is just some text
count_table <- read_tsv(file="~/qiime2_export/table.tsv", skip = 1)
## Rows: 245 Columns: 25
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: "\t"
## chr  (1): #OTU ID
## dbl (24): RUN01_POL002_KIEL, RUN07_KAP001_VESTEC1B_50C, RUN07_KAP002_VESTEC1...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# And specify that the first column of data are rownames
count_table <- column_to_rownames(count_table, var = colnames(count_table)[1])
# Import taxonomy of ASVs
taxonomy <- read_tsv(file="~/qiime2_export/taxonomy.tsv")
## Rows: 1645 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: "\t"
## chr (2): Feature ID, Taxon
## dbl (1): Confidence
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Import tree file 
tree = read_tree("~/qiime2_export/tree.nwk")
# Import fasta
asv_fasta <- read.FASTA(file = "~/qiime2_export/dna-sequences.fasta")

Import metadata file

sample_info_tab <- read_tsv(file="~/qiime2_export/metadata-final.tsv")
## Rows: 24 Columns: 2
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: "\t"
## chr (2): sample-id, absolute-filepath
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Also delete the row with the QIIME2 category codes
sample_info_tab<- sample_info_tab[-c(1),]

Decontam

we can do further decontamination with Decontam package but I need blanks or controls and DNA concentrations to do that see more here: https://www.bioconductor.org/packages/devel/bioc/vignettes/decontam/inst/doc/decontam_intro.html example code can be found at BVCN lessons website

Check sequencing depth with rarefaction curves

There are many caveats to using rarefaction curves for interpreting diversity, so use them with caution. Still, they give a sense of your sampling depth and are useful in comparing across samples from the same dataset.

# Use rarecurve, from the Vegan package. Rarecurve expects the dataset as a dataframe so we need to use as.data.frame again:
count_table_df <- as.data.frame(count_table)
# Plot the rarefaction curves, color-coding by the colors listed in sample_info_tab, which indicate sample type, and transforming using t() again
#define your colors
col <- c("black", "darkred", "forestgreen", "orange", "blue", "yellow", "hotpink")
#plot rarefraction curve
rarecurve(t(count_table_df), step=100, cex=0.5, col=col, lwd=2, ylab="nr. of ASVs", xlab="nr. of reads", label=T)
## Warning in rarecurve(t(count_table_df), step = 100, cex = 0.5, col = col, : most
## observed count data have counts 1, but smallest count is 2
# And add a vertical line to the plot indicating the fewest # of sequences from any sample
abline(v=(min(rowSums(t(count_table_df)))))

Remove or check singletons

You may or may not want to remove singletons based on whether or not you have ASVs or OTUs and what your scientific questions are. DADA2 also tries to reduce the number of singletons, so you may not have any. But still some can be introduced. I am removing singletons and ASVs that have an abundance of zero (<=1).

There are multiple ways to do this. here an example using tidyverse:

count_table_no_singletons <- filter(count_table,rowSums(count_table)>1)
# retains all ASVs, no sigletons found in my data

Transformation

In this tutorial / set of examples, we are considering tag-sequencing or amplicon data. This type of sequence data is compositional, because it does not represent true absolute abundances. The total number of sequences in your dataset is arbitrary (set by sequence depth), thus it is inappropriate to make conclusions about the abundance of a given species or taxa (what was targeted in the sequencing effort). It is important to acknowledge this in both your an analysis and interpretation of the data.

Recommended resources to learn more about dealing with compositional data:

  • Gloor, G. B., Macklaim, J. M., Pawlowsky-Glahn, V. & Egozcue, J. J. Microbiome Datasets Are Compositional: And This Is Not Optional. Front. Microbiol. 8, 57–6 (2017).
  • Weiss, S. et al. Normalization and microbial differential abundance strategies depend upon data characteristics. Microbiome 5, 1–18 (2017).
  • McMurdie, P. J. & Holmes, S. Waste Not, Want Not: Why Rarefying Microbiome Data Is Inadmissible. PLoS Comput Biol 10, e1003531 (2014).
  • Coenen AR, Hu SK, Luo E, Muratore D, Weitz JS. A Primer for Microbiome Time-Series Analysis. Front Genet. 2020 Apr 21;11:310.

Normalization of abundance data is important in order to equalize for sampling effort across your samples before you continue with any downstream analyses. There are a variety of ways to do this, and depending on your downstream purposes, some are better than others. As described in Happy Belly, the most simple type of normalization is to standardize to 1 (so you are dealing with fractional abundances) or to re-sample at the depth of your least deeply sampled sample. However, these are not generally accepted practices any more, particularly if you are going to do multivariate tests downstream. This is because these basic normalization approaches do not adjust the distribution of the abundance data, and many downstream statistical tests assume a normal distribution.

Happy Belly uses a great approach to transformation using the DeSeq2 package called a variance-stabilizing transformation that is described in this paper.

Another common approach that is demonstrated here is a Hellinger transformation (described in this paper), which also reduces the influence of zeroes. In sequencing data, an abundance score of zero does not actually mean that an ASV is not present. It could mean that we just did not sequence it. Furthermore, an ASV with an abundance of zero will be strongly negatively correlated with an ASV that has a high abundance from the same sample. But in microbial ecology, the absence of one ASV is not necessarily ecologically meaningful to describe the presence of another, and so we want to reduce the influence of these zeroes (and in general, low abundance ASVs) in our dataset while also fitting our distribution to a normal distribution.

The Vegan package has a variety of transformation options in the decostand function. Here I will call the Hellinger method:

count_table_hellinger <- decostand(t(count_table), "hellinger")
# Note that Vegan expected my rows as samples, so I had to use t() again
# Convert back to a tibble and switch rows for columns again (as_tibble has a weird thing where it will delete the rownames unless you specify rownames = NA)
count_table_hellinger <- as_tibble(t(count_table_hellinger), rownames = NA)
# Compare the untransformed to transformed data
head(count_table_no_singletons) #untransformed
##                                  RUN01_POL002_KIEL RUN07_KAP001_VESTEC1B_50C
## 01526cf494f2fcf738be01ef080d408f                 0                         0
## 01b3ea661423ef0cfa99c5b775825e6a                 0                         0
## 02000fddd9b670d543849a9aa35b43ea                 0                         0
## 02fa8a6a9bb5be33f972390f2c57ed7a                 0                         0
## 04d0173bf28e95d50fc8e270d7e76362                 0                         0
## 0936b8d068baaa522d9de3a0b629e8b9                 0                         0
##                                  RUN07_KAP002_VESTEC1B_MEDIA
## 01526cf494f2fcf738be01ef080d408f                           0
## 01b3ea661423ef0cfa99c5b775825e6a                           0
## 02000fddd9b670d543849a9aa35b43ea                           0
## 02fa8a6a9bb5be33f972390f2c57ed7a                           0
## 04d0173bf28e95d50fc8e270d7e76362                           0
## 0936b8d068baaa522d9de3a0b629e8b9                           0
##                                  RUN07_KAP003_TRIANGL_20C
## 01526cf494f2fcf738be01ef080d408f                        0
## 01b3ea661423ef0cfa99c5b775825e6a                        0
## 02000fddd9b670d543849a9aa35b43ea                        0
## 02fa8a6a9bb5be33f972390f2c57ed7a                        0
## 04d0173bf28e95d50fc8e270d7e76362                        0
## 0936b8d068baaa522d9de3a0b629e8b9                        0
##                                  RUN07_KAP004_TRIANGL_MEDIA
## 01526cf494f2fcf738be01ef080d408f                          0
## 01b3ea661423ef0cfa99c5b775825e6a                          0
## 02000fddd9b670d543849a9aa35b43ea                          0
## 02fa8a6a9bb5be33f972390f2c57ed7a                          0
## 04d0173bf28e95d50fc8e270d7e76362                          0
## 0936b8d068baaa522d9de3a0b629e8b9                          0
##                                  RUN07_KAP007_TINTI2H_5C RUN07_KAP008_SLOV3A_7C
## 01526cf494f2fcf738be01ef080d408f                       0                      0
## 01b3ea661423ef0cfa99c5b775825e6a                       0                      0
## 02000fddd9b670d543849a9aa35b43ea                       0                      0
## 02fa8a6a9bb5be33f972390f2c57ed7a                       0                      0
## 04d0173bf28e95d50fc8e270d7e76362                       0                      0
## 0936b8d068baaa522d9de3a0b629e8b9                       0                      0
##                                  RUN07_KAP010_TINTI2H_MEDIA2
## 01526cf494f2fcf738be01ef080d408f                           0
## 01b3ea661423ef0cfa99c5b775825e6a                           0
## 02000fddd9b670d543849a9aa35b43ea                           0
## 02fa8a6a9bb5be33f972390f2c57ed7a                           0
## 04d0173bf28e95d50fc8e270d7e76362                           0
## 0936b8d068baaa522d9de3a0b629e8b9                           0
##                                  RUN03_KAP028_NA1_10C_001
## 01526cf494f2fcf738be01ef080d408f                        0
## 01b3ea661423ef0cfa99c5b775825e6a                        0
## 02000fddd9b670d543849a9aa35b43ea                        0
## 02fa8a6a9bb5be33f972390f2c57ed7a                        0
## 04d0173bf28e95d50fc8e270d7e76362                        0
## 0936b8d068baaa522d9de3a0b629e8b9                      534
##                                  RUN03_KAP029_NA1_MEDIA_001
## 01526cf494f2fcf738be01ef080d408f                          0
## 01b3ea661423ef0cfa99c5b775825e6a                          0
## 02000fddd9b670d543849a9aa35b43ea                          0
## 02fa8a6a9bb5be33f972390f2c57ed7a                          0
## 04d0173bf28e95d50fc8e270d7e76362                          0
## 0936b8d068baaa522d9de3a0b629e8b9                       2288
##                                  RUN03_KAP030_RACEKX3_50C_001
## 01526cf494f2fcf738be01ef080d408f                            0
## 01b3ea661423ef0cfa99c5b775825e6a                            0
## 02000fddd9b670d543849a9aa35b43ea                            0
## 02fa8a6a9bb5be33f972390f2c57ed7a                            0
## 04d0173bf28e95d50fc8e270d7e76362                            0
## 0936b8d068baaa522d9de3a0b629e8b9                            0
##                                  RUN03_KAP031_STOD4_2C_001
## 01526cf494f2fcf738be01ef080d408f                         0
## 01b3ea661423ef0cfa99c5b775825e6a                         0
## 02000fddd9b670d543849a9aa35b43ea                         0
## 02fa8a6a9bb5be33f972390f2c57ed7a                         0
## 04d0173bf28e95d50fc8e270d7e76362                         0
## 0936b8d068baaa522d9de3a0b629e8b9                         0
##                                  RUN03_KAP032_STOD4_20C_001
## 01526cf494f2fcf738be01ef080d408f                          0
## 01b3ea661423ef0cfa99c5b775825e6a                          0
## 02000fddd9b670d543849a9aa35b43ea                          0
## 02fa8a6a9bb5be33f972390f2c57ed7a                          0
## 04d0173bf28e95d50fc8e270d7e76362                          0
## 0936b8d068baaa522d9de3a0b629e8b9                          0
##                                  RUN03_KAP033_SANPEDRO_10C_001
## 01526cf494f2fcf738be01ef080d408f                             0
## 01b3ea661423ef0cfa99c5b775825e6a                             0
## 02000fddd9b670d543849a9aa35b43ea                             0
## 02fa8a6a9bb5be33f972390f2c57ed7a                             0
## 04d0173bf28e95d50fc8e270d7e76362                             0
## 0936b8d068baaa522d9de3a0b629e8b9                             0
##                                  RUN06_KAP001_KIEL9_10C.fastq
## 01526cf494f2fcf738be01ef080d408f                            0
## 01b3ea661423ef0cfa99c5b775825e6a                            0
## 02000fddd9b670d543849a9aa35b43ea                            0
## 02fa8a6a9bb5be33f972390f2c57ed7a                            0
## 04d0173bf28e95d50fc8e270d7e76362                            0
## 0936b8d068baaa522d9de3a0b629e8b9                            0
##                                  RUN06_KAP002_NA1_10C.fastq
## 01526cf494f2fcf738be01ef080d408f                          0
## 01b3ea661423ef0cfa99c5b775825e6a                          0
## 02000fddd9b670d543849a9aa35b43ea                          0
## 02fa8a6a9bb5be33f972390f2c57ed7a                          0
## 04d0173bf28e95d50fc8e270d7e76362                          0
## 0936b8d068baaa522d9de3a0b629e8b9                          0
##                                  RUN06_KAP003_IZR_10C.fastq
## 01526cf494f2fcf738be01ef080d408f                          0
## 01b3ea661423ef0cfa99c5b775825e6a                          0
## 02000fddd9b670d543849a9aa35b43ea                          0
## 02fa8a6a9bb5be33f972390f2c57ed7a                          0
## 04d0173bf28e95d50fc8e270d7e76362                          0
## 0936b8d068baaa522d9de3a0b629e8b9                          0
##                                  RUN06_KAP004_BALK10C.fastq
## 01526cf494f2fcf738be01ef080d408f                         10
## 01b3ea661423ef0cfa99c5b775825e6a                          0
## 02000fddd9b670d543849a9aa35b43ea                          0
## 02fa8a6a9bb5be33f972390f2c57ed7a                         29
## 04d0173bf28e95d50fc8e270d7e76362                          0
## 0936b8d068baaa522d9de3a0b629e8b9                          0
##                                  RUN06_KAP005_ITA01A_3C.fastq
## 01526cf494f2fcf738be01ef080d408f                            0
## 01b3ea661423ef0cfa99c5b775825e6a                          302
## 02000fddd9b670d543849a9aa35b43ea                            0
## 02fa8a6a9bb5be33f972390f2c57ed7a                            0
## 04d0173bf28e95d50fc8e270d7e76362                            0
## 0936b8d068baaa522d9de3a0b629e8b9                            0
##                                  RUN06_KAP006_SLOV3A_7C.fastq
## 01526cf494f2fcf738be01ef080d408f                            0
## 01b3ea661423ef0cfa99c5b775825e6a                            0
## 02000fddd9b670d543849a9aa35b43ea                            0
## 02fa8a6a9bb5be33f972390f2c57ed7a                            0
## 04d0173bf28e95d50fc8e270d7e76362                            0
## 0936b8d068baaa522d9de3a0b629e8b9                            0
##                                  RUN06_KAP008_NA1_5C.fastq
## 01526cf494f2fcf738be01ef080d408f                         0
## 01b3ea661423ef0cfa99c5b775825e6a                         0
## 02000fddd9b670d543849a9aa35b43ea                         0
## 02fa8a6a9bb5be33f972390f2c57ed7a                         0
## 04d0173bf28e95d50fc8e270d7e76362                         0
## 0936b8d068baaa522d9de3a0b629e8b9                         0
##                                  RUN06_KAP010_NA1_MEDIA.fastq
## 01526cf494f2fcf738be01ef080d408f                            0
## 01b3ea661423ef0cfa99c5b775825e6a                            0
## 02000fddd9b670d543849a9aa35b43ea                            2
## 02fa8a6a9bb5be33f972390f2c57ed7a                            0
## 04d0173bf28e95d50fc8e270d7e76362                            0
## 0936b8d068baaa522d9de3a0b629e8b9                            0
##                                  RUN06_KAP011_IZR_MEDIA.fastq
## 01526cf494f2fcf738be01ef080d408f                            0
## 01b3ea661423ef0cfa99c5b775825e6a                            0
## 02000fddd9b670d543849a9aa35b43ea                            0
## 02fa8a6a9bb5be33f972390f2c57ed7a                            0
## 04d0173bf28e95d50fc8e270d7e76362                           11
## 0936b8d068baaa522d9de3a0b629e8b9                            0
##                                  RUN06_KAP012_STOD4_3C.fastq
## 01526cf494f2fcf738be01ef080d408f                           0
## 01b3ea661423ef0cfa99c5b775825e6a                           0
## 02000fddd9b670d543849a9aa35b43ea                           0
## 02fa8a6a9bb5be33f972390f2c57ed7a                           0
## 04d0173bf28e95d50fc8e270d7e76362                           0
## 0936b8d068baaa522d9de3a0b629e8b9                           0
head(count_table_hellinger) #transformed
## # A tibble: 6 × 24
##   RUN01_POL002…¹ RUN07…² RUN07…³ RUN07…⁴ RUN07…⁵ RUN07…⁶ RUN07…⁷ RUN07…⁸ RUN03…⁹
##            <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
## 1              0       0       0       0       0       0       0       0   0    
## 2              0       0       0       0       0       0       0       0   0    
## 3              0       0       0       0       0       0       0       0   0    
## 4              0       0       0       0       0       0       0       0   0    
## 5              0       0       0       0       0       0       0       0   0    
## 6              0       0       0       0       0       0       0       0   0.163
## # … with 15 more variables: RUN03_KAP029_NA1_MEDIA_001 <dbl>,
## #   RUN03_KAP030_RACEKX3_50C_001 <dbl>, RUN03_KAP031_STOD4_2C_001 <dbl>,
## #   RUN03_KAP032_STOD4_20C_001 <dbl>, RUN03_KAP033_SANPEDRO_10C_001 <dbl>,
## #   RUN06_KAP001_KIEL9_10C.fastq <dbl>, RUN06_KAP002_NA1_10C.fastq <dbl>,
## #   RUN06_KAP003_IZR_10C.fastq <dbl>, RUN06_KAP004_BALK10C.fastq <dbl>,
## #   RUN06_KAP005_ITA01A_3C.fastq <dbl>, RUN06_KAP006_SLOV3A_7C.fastq <dbl>,
## #   RUN06_KAP008_NA1_5C.fastq <dbl>, RUN06_KAP010_NA1_MEDIA.fastq <dbl>, …

its somehow not correct, after transforming the rows and columns as tibble, some of the columns still have 0, not 0.00000

Export cleaned data to tsv or fasta or nwk files

At this point, you could continue working in R with the R objects in your Global Environment as is. But it’s also a good idea to export these cleaned-up files, in case you want to upload them into another project later:

# Make a results directory
dir.create("results")
## Warning in dir.create("results"): 'results' already exists
dir.create("results/cleaned_files")
## Warning in dir.create("results/cleaned_files"): 'results\cleaned_files' already
## exists
# And export all files:
# fasta
write.fasta(asv_fasta, taxonomy$`Feature ID`,"results/cleaned_files/HappyBelly_ASVs.fasta")
# transformed count table
write.table(count_table_hellinger, "~/qiime2_export/count_table_hellinger.tsv",sep="\t", quote=F, col.names=NA)
# taxonomy
write.table(taxonomy, "results/cleaned_files/taxonomy.tsv",sep="\t", quote=F, col.names=NA)
# tree File
ape::write.tree(tree, "results/cleaned_files/tree.nwk")

Another option is to save() your files in R as an R object

save(asv_fasta, count_table_hellinger, taxonomy, tree, file = "cleanedfiles.RData")

And later when you can load them with

#load("cleanedfiles.RData")