Detecting contamination in microbiome research with decontam

This dataset was obtained from the doctoral dissertation of Dr. LaShanda Williams. You will find the sequences in this dataset here.
Note: The samples were shotgun sequenced and only 16S profiles were used in the analysis below.

knitr::opts_chunk$set(echo = TRUE)

library(phyloseq)
library(decontam)
library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──

## ✓ ggplot2 3.3.5     ✓ purrr   0.3.4
## ✓ tibble  3.1.5     ✓ dplyr   1.0.7
## ✓ tidyr   1.1.4     ✓ stringr 1.4.0
## ✓ readr   2.0.2     ✓ forcats 0.5.1

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(DT)

load(file = "diss_phyloseq.RData")
classification <- c("kingdom", "phylum", "class", "order", "family", "genus", "species")
otu_taxonomy <- read_delim("otu_info/97_otu_taxonomy.txt", col_names = classification)

## Rows: 99322 Columns: 7

## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: " "
## chr (7): kingdom, phylum, class, order, family, genus, species

## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

a <- prune_taxa(taxa_sums(a) > 1, a)
a <- prune_samples(sample_sums(a) > 0, a)
a

## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 15635 taxa and 54 samples ]
## sample_data() Sample Data:       [ 54 samples by 9 sample variables ]
## tax_table()   Taxonomy Table:    [ 15635 taxa by 7 taxonomic ranks ]

datatable(sample_data(a))

We’ll begin by getting an idea of how library size compares between samples (calculus) and blanks. This is an important first step especially in low biomass samples since negatives can have similar library sizes to samples in those cases.

df <- as.data.frame(sample_data(a))
df$LibrarySize <- sample_sums(a)
df <- df[order(df$LibrarySize),]
df$Index <- seq(nrow(df))
ggplot(data=df, aes(x=Index, y=LibrarySize, color=Env)) + geom_point()

We can see the library size is low in negatives (blanks) compared to samples. While there are low yield samples that may be considered for removal, we’ll keep them in this preliminary contamination analysis.

Next, let’s see how much of the variation within the dataset can be explained by metadata features.

a.ord <- ordinate(a, "PCoA", "bray")
plot_ordination(a, a.ord, type="samples", color="Env")

Axis 1 and 2 explain ~20% of the variation in the dataset. The samples appear to be more distinct from each other than the negatives which cluster very closely to one another.

Next, let’s begin our contamination analysis using the isContaminant function from the decontam package using the prevalence method. Ideally you’d do some exploratory analysis to see if which threshold is most appropriate for classification of contaminants vs. non-contaminants. We illustrate how to change the threshold below.

The default threshold is 0.1 and we illustrate the results from the more aggressive threshold set at 0.5. Sequences are classified as contaminants if their probability or p is under 0.1.

sample_data(a)$is.neg <- sample_data(a)$Env == "blank"
contamdf.prev.01 <- isContaminant(a, method="prevalence", neg="is.neg", threshold=0.1)
table(contamdf.prev.01$contaminant)

## 
## FALSE  TRUE 
## 15630     5

contamdf.prev.05 <- isContaminant(a, method="prevalence", neg="is.neg", threshold=0.5)
table(contamdf.prev.05$contaminant)

## 
## FALSE  TRUE 
## 15451   184

row_indices <- which(contamdf.prev.05$contaminant) #grab the row indices that correspond with identified contaminants to locate taxonomic information in the corresponding OTU file

It is recommended that you extract the taxonomic information of identified contaminants and review the literature in your study area to learn more about their expected presence/absence in your research.

taxonomy_table <- tibble()

for (i in row_indices){
  loc <-  contamdf.prev.05[i, 0]
  tax_key <- row.names(loc)
  tax_value <- otu_taxonomy[tax_key, ]
  taxonomy_table <- rbind(taxonomy_table, tax_value)
}

## Warning: One or more parsing issues, see `problems()` for details

names(taxonomy_table) <- classification
datatable(taxonomy_table)

Finally, you can quickly prune contaminant taxa using phyloseq’s prune_taxa function.

final_biom <- prune_taxa(!contamdf.prev.05$contaminant, a)
final_biom

## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 15451 taxa and 54 samples ]
## sample_data() Sample Data:       [ 54 samples by 9 sample variables ]
## tax_table()   Taxonomy Table:    [ 15451 taxa by 7 taxonomic ranks ]

sessionInfo()

## R version 3.6.3 (2020-02-29)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS Catalina 10.15.7
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] DT_0.19         forcats_0.5.1   stringr_1.4.0   dplyr_1.0.7    
##  [5] purrr_0.3.4     readr_2.0.2     tidyr_1.1.4     tibble_3.1.5   
##  [9] ggplot2_3.3.5   tidyverse_1.3.1 decontam_1.13.0 phyloseq_1.30.0
## 
## loaded via a namespace (and not attached):
##  [1] nlme_3.1-153        fs_1.5.0            bit64_4.0.5        
##  [4] lubridate_1.8.0     httr_1.4.2          tools_3.6.3        
##  [7] backports_1.3.0     bslib_0.3.1         utf8_1.2.2         
## [10] R6_2.5.1            vegan_2.5-7         DBI_1.1.1          
## [13] BiocGenerics_0.32.0 mgcv_1.8-38         colorspace_2.0-2   
## [16] permute_0.9-5       ade4_1.7-18         withr_2.4.2        
## [19] tidyselect_1.1.1    bit_4.0.4           compiler_3.6.3     
## [22] cli_3.1.0           rvest_1.0.2         Biobase_2.46.0     
## [25] xml2_1.3.2          labeling_0.4.2      sass_0.4.0         
## [28] scales_1.1.1        digest_0.6.28       rmarkdown_2.11     
## [31] XVector_0.26.0      pkgconfig_2.0.3     htmltools_0.5.2    
## [34] highr_0.9           dbplyr_2.1.1        fastmap_1.1.0      
## [37] htmlwidgets_1.5.4   rlang_0.4.12        readxl_1.3.1       
## [40] rstudioapi_0.13     farver_2.1.0        jquerylib_0.1.4    
## [43] generics_0.1.1      jsonlite_1.7.2      crosstalk_1.1.1    
## [46] vroom_1.5.5         magrittr_2.0.1      biomformat_1.14.0  
## [49] Matrix_1.3-4        Rcpp_1.0.7          munsell_0.5.0      
## [52] S4Vectors_0.24.4    Rhdf5lib_1.8.0      fansi_0.5.0        
## [55] ape_5.5             lifecycle_1.0.1     stringi_1.7.5      
## [58] yaml_2.2.1          MASS_7.3-54         zlibbioc_1.32.0    
## [61] rhdf5_2.30.1        plyr_1.8.6          grid_3.6.3         
## [64] parallel_3.6.3      crayon_1.4.2        lattice_0.20-45    
## [67] Biostrings_2.54.0   haven_2.4.3         splines_3.6.3      
## [70] multtest_2.42.0     hms_1.1.1           knitr_1.36         
## [73] pillar_1.6.4        igraph_1.2.7        reshape2_1.4.4     
## [76] codetools_0.2-18    stats4_3.6.3        reprex_2.0.1       
## [79] glue_1.4.2          evaluate_0.14       data.table_1.14.2  
## [82] modelr_0.1.8        vctrs_0.3.8         tzdb_0.2.0         
## [85] foreach_1.5.1       cellranger_1.1.0    gtable_0.3.0       
## [88] assertthat_0.2.1    xfun_0.27           broom_0.7.10       
## [91] survival_3.2-13     iterators_1.0.13    IRanges_2.20.2     
## [94] cluster_2.1.2       ellipsis_0.3.2

Need help in with your microbiome data analysis? Learn more about our Microbiome Data Analysis Suite offered by the BioData Lab.

Or schedule a consultation with us today!

Detecting contamination in microbiome research with decontam

LaShanda Rena Williams, PhD (BioData Lab)

11/15/2021