COD_20230927_MGK_IBS7048_week5_introduction_to_DADA2

Minsik Kim

2023-09-27

DADA2

You can find the original DADA pipeline at

https://benjjneb.github.io/dada2/

Meanwhile, I will add

Installing DADA2

Binaries for the current release version of DADA2 (1.26) are available from Bioconductor. Note that you must have R 4.2.0 or newer, and Bioconductor version 3.16, to install the most current release from Bioconductor.

If you are using R4.3 or higher, you should install dada2 version 3.17.

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("dada2", version = "3.17")

Load DADA2 to your environment

Now, your DADA2 package was installed to your computer. That does not mean it is loaded to your working environment.

To double-check the loaded packages, you can run sessionInfo().

sessionInfo()
## R version 4.3.1 (2023-06-16)
## Platform: x86_64-apple-darwin20 (64-bit)
## Running under: macOS Ventura 13.5
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/lib/libRblas.0.dylib 
## LAPACK: /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## time zone: Asia/Seoul
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] yaml_2.3.7      lubridate_1.9.2 forcats_1.0.0   stringr_1.5.0  
##  [5] dplyr_1.1.3     purrr_1.0.2     readr_2.1.4     tidyr_1.3.0    
##  [9] tibble_3.2.1    ggplot2_3.4.3   tidyverse_2.0.0 pacman_0.5.1   
## 
## loaded via a namespace (and not attached):
##  [1] gtable_0.3.4        jsonlite_1.8.7      compiler_4.3.1     
##  [4] BiocManager_1.30.22 tidyselect_1.2.0    jquerylib_0.1.4    
##  [7] scales_1.2.1        fastmap_1.1.1       R6_2.5.1           
## [10] generics_0.1.3      knitr_1.44          bookdown_0.35      
## [13] munsell_0.5.0       tzdb_0.4.0          bslib_0.5.1        
## [16] pillar_1.9.0        rlang_1.1.1         utf8_1.2.3         
## [19] stringi_1.7.12      cachem_1.0.8        xfun_0.40          
## [22] sass_0.4.7          timechange_0.2.0    cli_3.6.1          
## [25] withr_2.5.1         magrittr_2.0.3      rmdformats_1.0.4   
## [28] digest_0.6.33       grid_4.3.1          rstudioapi_0.15.0  
## [31] hms_1.1.3           lifecycle_1.0.3     vctrs_0.6.3        
## [34] evaluate_0.21       glue_1.6.2          fansi_1.0.4        
## [37] colorspace_2.1-0    rmarkdown_2.25      tools_4.3.1        
## [40] pkgconfig_2.0.3     htmltools_0.5.6

As you can see, DADA2 is not loaded. To load package, use library() function.

library(dada2)


sessionInfo()
## R version 4.3.1 (2023-06-16)
## Platform: x86_64-apple-darwin20 (64-bit)
## Running under: macOS Ventura 13.5
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/lib/libRblas.0.dylib 
## LAPACK: /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## time zone: Asia/Seoul
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] dada2_1.28.0    Rcpp_1.0.11     yaml_2.3.7      lubridate_1.9.2
##  [5] forcats_1.0.0   stringr_1.5.0   dplyr_1.1.3     purrr_1.0.2    
##  [9] readr_2.1.4     tidyr_1.3.0     tibble_3.2.1    ggplot2_3.4.3  
## [13] tidyverse_2.0.0 pacman_0.5.1   
## 
## loaded via a namespace (and not attached):
##  [1] tidyselect_1.2.0            Biostrings_2.68.1          
##  [3] bitops_1.0-7                fastmap_1.1.1              
##  [5] RCurl_1.98-1.12             GenomicAlignments_1.36.0   
##  [7] digest_0.6.33               timechange_0.2.0           
##  [9] lifecycle_1.0.3             magrittr_2.0.3             
## [11] compiler_4.3.1              rlang_1.1.1                
## [13] sass_0.4.7                  tools_4.3.1                
## [15] utf8_1.2.3                  knitr_1.44                 
## [17] S4Arrays_1.0.6              interp_1.1-4               
## [19] DelayedArray_0.26.7         plyr_1.8.8                 
## [21] RColorBrewer_1.1-3          ShortRead_1.58.0           
## [23] abind_1.4-5                 BiocParallel_1.34.2        
## [25] withr_2.5.1                 hwriter_1.3.2.1            
## [27] BiocGenerics_0.46.0         grid_4.3.1                 
## [29] stats4_4.3.1                fansi_1.0.4                
## [31] latticeExtra_0.6-30         colorspace_2.1-0           
## [33] scales_1.2.1                SummarizedExperiment_1.30.2
## [35] cli_3.6.1                   rmarkdown_2.25             
## [37] crayon_1.5.2                generics_0.1.3             
## [39] RcppParallel_5.1.7          rstudioapi_0.15.0          
## [41] reshape2_1.4.4              tzdb_0.4.0                 
## [43] cachem_1.0.8                zlibbioc_1.46.0            
## [45] parallel_4.3.1              BiocManager_1.30.22        
## [47] XVector_0.40.0              matrixStats_1.0.0          
## [49] rmdformats_1.0.4            vctrs_0.6.3                
## [51] Matrix_1.6-1.1              jsonlite_1.8.7             
## [53] bookdown_0.35               IRanges_2.34.1             
## [55] hms_1.1.3                   S4Vectors_0.38.2           
## [57] jpeg_0.1-10                 jquerylib_0.1.4            
## [59] glue_1.6.2                  codetools_0.2-19           
## [61] stringi_1.7.12              gtable_0.3.4               
## [63] deldir_1.0-9                GenomeInfoDb_1.36.3        
## [65] GenomicRanges_1.52.0        munsell_0.5.0              
## [67] pillar_1.9.0                htmltools_0.5.6            
## [69] GenomeInfoDbData_1.2.10     R6_2.5.1                   
## [71] evaluate_0.21               Biobase_2.60.0             
## [73] lattice_0.21-8              png_0.1-8                  
## [75] Rsamtools_2.16.0            bslib_0.5.1                
## [77] xfun_0.40                   MatrixGenerics_1.12.3      
## [79] pkgconfig_2.0.3

You can see the dada2 package was loaded.

Opening multiple files for dada2 in R

Without pointing the exact path of a file that you want to play with, R will employ thing at your current working directory.

getwd() will show you the current directory that you are working on.

getwd()
## [1] "/Users/minsikkim/Dropbox (Personal)/Inha/5_Lectures/Advanced biostatistics/scripts/BTE3207_Advanced_Biostatistics"

list.files() function shows all the files in the current list.

list.files()
##  [1] "BTE3207_Advanced_Biostatistics.Rproj"
##  [2] "COD_week1_2_MGK_BTE3207.html"        
##  [3] "COD_week1_2_MGK_BTE3207.R"           
##  [4] "COD_week1_2_MGK_BTE3207.Rmd"         
##  [5] "COD_week2_1_MGK_BTE3207.html"        
##  [6] "COD_week2_1_MGK_BTE3207.Rmd"         
##  [7] "COD_week2_2_MGK_BTE3207.html"        
##  [8] "COD_week2_2_MGK_BTE3207.Rmd"         
##  [9] "COD_week4_1_MGK_BTE3207.html"        
## [10] "COD_week4_1_MGK_BTE3207.R"           
## [11] "COD_week4_1_MGK_BTE3207.Rmd"         
## [12] "COD_week4_2_MGK_BTE3207.html"        
## [13] "COD_week5_1_MGK_BTE3207.html"        
## [14] "COD_week5_1_MGK_BTE3207.Rmd"         
## [15] "COD_week5_2_MGK_BTE3207.Rmd"         
## [16] "confounder_plots.pdf"                
## [17] "dataset"                             
## [18] "interpretation.pdf"                  
## [19] "README.md"                           
## [20] "rsconnect"

You can also set the path of a folder of your interest for list.files().

list.files("/Users/minsikkim/Dropbox (Personal)/Inha/5_Lectures/Advanced metagenomics/IBS7048_dataset")
##  [1] "F3D0_S188_L001_R1_001.fastq"        "F3D0_S188_L001_R2_001.fastq"       
##  [3] "F3D1_S189_L001_R1_001.fastq"        "F3D1_S189_L001_R2_001.fastq"       
##  [5] "F3D141_S207_L001_R1_001.fastq"      "F3D141_S207_L001_R2_001.fastq"     
##  [7] "F3D142_S208_L001_R1_001.fastq"      "F3D142_S208_L001_R2_001.fastq"     
##  [9] "F3D143_S209_L001_R1_001.fastq"      "F3D143_S209_L001_R2_001.fastq"     
## [11] "F3D144_S210_L001_R1_001.fastq"      "F3D144_S210_L001_R2_001.fastq"     
## [13] "F3D145_S211_L001_R1_001.fastq"      "F3D145_S211_L001_R2_001.fastq"     
## [15] "F3D146_S212_L001_R1_001.fastq"      "F3D146_S212_L001_R2_001.fastq"     
## [17] "F3D147_S213_L001_R1_001.fastq"      "F3D147_S213_L001_R2_001.fastq"     
## [19] "F3D148_S214_L001_R1_001.fastq"      "F3D148_S214_L001_R2_001.fastq"     
## [21] "F3D149_S215_L001_R1_001.fastq"      "F3D149_S215_L001_R2_001.fastq"     
## [23] "F3D150_S216_L001_R1_001.fastq"      "F3D150_S216_L001_R2_001.fastq"     
## [25] "F3D2_S190_L001_R1_001.fastq"        "F3D2_S190_L001_R2_001.fastq"       
## [27] "F3D3_S191_L001_R1_001.fastq"        "F3D3_S191_L001_R2_001.fastq"       
## [29] "F3D5_S193_L001_R1_001.fastq"        "F3D5_S193_L001_R2_001.fastq"       
## [31] "F3D6_S194_L001_R1_001.fastq"        "F3D6_S194_L001_R2_001.fastq"       
## [33] "F3D7_S195_L001_R1_001.fastq"        "F3D7_S195_L001_R2_001.fastq"       
## [35] "F3D8_S196_L001_R1_001.fastq"        "F3D8_S196_L001_R2_001.fastq"       
## [37] "F3D9_S197_L001_R1_001.fastq"        "F3D9_S197_L001_R2_001.fastq"       
## [39] "filtered"                           "HMP_MOCK.v35.fasta"                
## [41] "Mock_S280_L001_R1_001.fastq"        "Mock_S280_L001_R2_001.fastq"       
## [43] "mouse.dpw.metadata"                 "mouse.time.design"                 
## [45] "silva_nr99_v138.1_train_set.fa"     "silva_nr99_v138.1_train_set.fa.zip"
## [47] "stability.batch"                    "stability.files"

This is my directory with the example files.

You can also navigate folders in R. Try typing list.files("/") and press tab.

Anyway, I can set name this path as path, as it is too long.

minsik_path = "/Users/minsikkim/Dropbox (Personal)/Inha/5_Lectures/Advanced metagenomics/IBS7048_dataset"

# you can also use `<-` instead of `=`.

This is my directory with the example files.

As now path is having the information of my location, I can substitute the long path name with path

list.files(minsik_path)
##  [1] "F3D0_S188_L001_R1_001.fastq"        "F3D0_S188_L001_R2_001.fastq"       
##  [3] "F3D1_S189_L001_R1_001.fastq"        "F3D1_S189_L001_R2_001.fastq"       
##  [5] "F3D141_S207_L001_R1_001.fastq"      "F3D141_S207_L001_R2_001.fastq"     
##  [7] "F3D142_S208_L001_R1_001.fastq"      "F3D142_S208_L001_R2_001.fastq"     
##  [9] "F3D143_S209_L001_R1_001.fastq"      "F3D143_S209_L001_R2_001.fastq"     
## [11] "F3D144_S210_L001_R1_001.fastq"      "F3D144_S210_L001_R2_001.fastq"     
## [13] "F3D145_S211_L001_R1_001.fastq"      "F3D145_S211_L001_R2_001.fastq"     
## [15] "F3D146_S212_L001_R1_001.fastq"      "F3D146_S212_L001_R2_001.fastq"     
## [17] "F3D147_S213_L001_R1_001.fastq"      "F3D147_S213_L001_R2_001.fastq"     
## [19] "F3D148_S214_L001_R1_001.fastq"      "F3D148_S214_L001_R2_001.fastq"     
## [21] "F3D149_S215_L001_R1_001.fastq"      "F3D149_S215_L001_R2_001.fastq"     
## [23] "F3D150_S216_L001_R1_001.fastq"      "F3D150_S216_L001_R2_001.fastq"     
## [25] "F3D2_S190_L001_R1_001.fastq"        "F3D2_S190_L001_R2_001.fastq"       
## [27] "F3D3_S191_L001_R1_001.fastq"        "F3D3_S191_L001_R2_001.fastq"       
## [29] "F3D5_S193_L001_R1_001.fastq"        "F3D5_S193_L001_R2_001.fastq"       
## [31] "F3D6_S194_L001_R1_001.fastq"        "F3D6_S194_L001_R2_001.fastq"       
## [33] "F3D7_S195_L001_R1_001.fastq"        "F3D7_S195_L001_R2_001.fastq"       
## [35] "F3D8_S196_L001_R1_001.fastq"        "F3D8_S196_L001_R2_001.fastq"       
## [37] "F3D9_S197_L001_R1_001.fastq"        "F3D9_S197_L001_R2_001.fastq"       
## [39] "filtered"                           "HMP_MOCK.v35.fasta"                
## [41] "Mock_S280_L001_R1_001.fastq"        "Mock_S280_L001_R2_001.fastq"       
## [43] "mouse.dpw.metadata"                 "mouse.time.design"                 
## [45] "silva_nr99_v138.1_train_set.fa"     "silva_nr99_v138.1_train_set.fa.zip"
## [47] "stability.batch"                    "stability.files"

Chooing files with pattern

From the list of files, we only want to select .fastq files. For that, I am going to make a list of files.

Since this file list is having forward and reverse reads per one sample set, we need to separate those files into two groups as well.

list.files(minsik_path, pattern = "_R1_001")
##  [1] "F3D0_S188_L001_R1_001.fastq"   "F3D1_S189_L001_R1_001.fastq"  
##  [3] "F3D141_S207_L001_R1_001.fastq" "F3D142_S208_L001_R1_001.fastq"
##  [5] "F3D143_S209_L001_R1_001.fastq" "F3D144_S210_L001_R1_001.fastq"
##  [7] "F3D145_S211_L001_R1_001.fastq" "F3D146_S212_L001_R1_001.fastq"
##  [9] "F3D147_S213_L001_R1_001.fastq" "F3D148_S214_L001_R1_001.fastq"
## [11] "F3D149_S215_L001_R1_001.fastq" "F3D150_S216_L001_R1_001.fastq"
## [13] "F3D2_S190_L001_R1_001.fastq"   "F3D3_S191_L001_R1_001.fastq"  
## [15] "F3D5_S193_L001_R1_001.fastq"   "F3D6_S194_L001_R1_001.fastq"  
## [17] "F3D7_S195_L001_R1_001.fastq"   "F3D8_S196_L001_R1_001.fastq"  
## [19] "F3D9_S197_L001_R1_001.fastq"   "Mock_S280_L001_R1_001.fastq"

Using pattern = argument will make a subset of list of files including the pattern noted in the quotation mark "".

With another argument, full.names = TRUE, it will give you the list of full-path-names of the files with the specified pattern.

list.files(minsik_path, pattern = "_R1_001", full.names = TRUE)
##  [1] "/Users/minsikkim/Dropbox (Personal)/Inha/5_Lectures/Advanced metagenomics/IBS7048_dataset/F3D0_S188_L001_R1_001.fastq"  
##  [2] "/Users/minsikkim/Dropbox (Personal)/Inha/5_Lectures/Advanced metagenomics/IBS7048_dataset/F3D1_S189_L001_R1_001.fastq"  
##  [3] "/Users/minsikkim/Dropbox (Personal)/Inha/5_Lectures/Advanced metagenomics/IBS7048_dataset/F3D141_S207_L001_R1_001.fastq"
##  [4] "/Users/minsikkim/Dropbox (Personal)/Inha/5_Lectures/Advanced metagenomics/IBS7048_dataset/F3D142_S208_L001_R1_001.fastq"
##  [5] "/Users/minsikkim/Dropbox (Personal)/Inha/5_Lectures/Advanced metagenomics/IBS7048_dataset/F3D143_S209_L001_R1_001.fastq"
##  [6] "/Users/minsikkim/Dropbox (Personal)/Inha/5_Lectures/Advanced metagenomics/IBS7048_dataset/F3D144_S210_L001_R1_001.fastq"
##  [7] "/Users/minsikkim/Dropbox (Personal)/Inha/5_Lectures/Advanced metagenomics/IBS7048_dataset/F3D145_S211_L001_R1_001.fastq"
##  [8] "/Users/minsikkim/Dropbox (Personal)/Inha/5_Lectures/Advanced metagenomics/IBS7048_dataset/F3D146_S212_L001_R1_001.fastq"
##  [9] "/Users/minsikkim/Dropbox (Personal)/Inha/5_Lectures/Advanced metagenomics/IBS7048_dataset/F3D147_S213_L001_R1_001.fastq"
## [10] "/Users/minsikkim/Dropbox (Personal)/Inha/5_Lectures/Advanced metagenomics/IBS7048_dataset/F3D148_S214_L001_R1_001.fastq"
## [11] "/Users/minsikkim/Dropbox (Personal)/Inha/5_Lectures/Advanced metagenomics/IBS7048_dataset/F3D149_S215_L001_R1_001.fastq"
## [12] "/Users/minsikkim/Dropbox (Personal)/Inha/5_Lectures/Advanced metagenomics/IBS7048_dataset/F3D150_S216_L001_R1_001.fastq"
## [13] "/Users/minsikkim/Dropbox (Personal)/Inha/5_Lectures/Advanced metagenomics/IBS7048_dataset/F3D2_S190_L001_R1_001.fastq"  
## [14] "/Users/minsikkim/Dropbox (Personal)/Inha/5_Lectures/Advanced metagenomics/IBS7048_dataset/F3D3_S191_L001_R1_001.fastq"  
## [15] "/Users/minsikkim/Dropbox (Personal)/Inha/5_Lectures/Advanced metagenomics/IBS7048_dataset/F3D5_S193_L001_R1_001.fastq"  
## [16] "/Users/minsikkim/Dropbox (Personal)/Inha/5_Lectures/Advanced metagenomics/IBS7048_dataset/F3D6_S194_L001_R1_001.fastq"  
## [17] "/Users/minsikkim/Dropbox (Personal)/Inha/5_Lectures/Advanced metagenomics/IBS7048_dataset/F3D7_S195_L001_R1_001.fastq"  
## [18] "/Users/minsikkim/Dropbox (Personal)/Inha/5_Lectures/Advanced metagenomics/IBS7048_dataset/F3D8_S196_L001_R1_001.fastq"  
## [19] "/Users/minsikkim/Dropbox (Personal)/Inha/5_Lectures/Advanced metagenomics/IBS7048_dataset/F3D9_S197_L001_R1_001.fastq"  
## [20] "/Users/minsikkim/Dropbox (Personal)/Inha/5_Lectures/Advanced metagenomics/IBS7048_dataset/Mock_S280_L001_R1_001.fastq"

So, now what?

We are going to store their list of names of files, so that some other function can use that information for loading that data at once.

forward_read_files <- list.files(minsik_path, pattern="_R1_001.fastq", full.names = TRUE)
reverse_read_files <- list.files(minsik_path, pattern="_R2_001.fastq", full.names = TRUE)

Meanwhile, they can be having different orders. So it would be helpful if we can sort the data right after loading it.

library(tidyverse)
forward_read_files <- list.files(minsik_path, pattern="_R1_001.fastq", full.names = TRUE) %>% sort()
reverse_read_files <- list.files(minsik_path, pattern="_R2_001.fastq", full.names = TRUE) %>% sort()

Manipulating characters

Plus, after processing the data, we need to get 1 sample name per 1 sample. It can be extracted from either one of the forward and reverse reads.

basename() removes all the higher path of file path.

forward_read_files %>% 
        basename()
##  [1] "F3D0_S188_L001_R1_001.fastq"   "F3D1_S189_L001_R1_001.fastq"  
##  [3] "F3D141_S207_L001_R1_001.fastq" "F3D142_S208_L001_R1_001.fastq"
##  [5] "F3D143_S209_L001_R1_001.fastq" "F3D144_S210_L001_R1_001.fastq"
##  [7] "F3D145_S211_L001_R1_001.fastq" "F3D146_S212_L001_R1_001.fastq"
##  [9] "F3D147_S213_L001_R1_001.fastq" "F3D148_S214_L001_R1_001.fastq"
## [11] "F3D149_S215_L001_R1_001.fastq" "F3D150_S216_L001_R1_001.fastq"
## [13] "F3D2_S190_L001_R1_001.fastq"   "F3D3_S191_L001_R1_001.fastq"  
## [15] "F3D5_S193_L001_R1_001.fastq"   "F3D6_S194_L001_R1_001.fastq"  
## [17] "F3D7_S195_L001_R1_001.fastq"   "F3D8_S196_L001_R1_001.fastq"  
## [19] "F3D9_S197_L001_R1_001.fastq"   "Mock_S280_L001_R1_001.fastq"

str_split_fixed() will separate all the charaters separated by the pattern specified in quotation mark "" and generate a matrix of the separated character. The number of separated output will be set as you set. (you can think of the import wizard in Excel).

forward_read_files %>% 
        basename() %>% 
        str_split_fixed(pattern = "_", n = 5)
##       [,1]     [,2]   [,3]   [,4] [,5]       
##  [1,] "F3D0"   "S188" "L001" "R1" "001.fastq"
##  [2,] "F3D1"   "S189" "L001" "R1" "001.fastq"
##  [3,] "F3D141" "S207" "L001" "R1" "001.fastq"
##  [4,] "F3D142" "S208" "L001" "R1" "001.fastq"
##  [5,] "F3D143" "S209" "L001" "R1" "001.fastq"
##  [6,] "F3D144" "S210" "L001" "R1" "001.fastq"
##  [7,] "F3D145" "S211" "L001" "R1" "001.fastq"
##  [8,] "F3D146" "S212" "L001" "R1" "001.fastq"
##  [9,] "F3D147" "S213" "L001" "R1" "001.fastq"
## [10,] "F3D148" "S214" "L001" "R1" "001.fastq"
## [11,] "F3D149" "S215" "L001" "R1" "001.fastq"
## [12,] "F3D150" "S216" "L001" "R1" "001.fastq"
## [13,] "F3D2"   "S190" "L001" "R1" "001.fastq"
## [14,] "F3D3"   "S191" "L001" "R1" "001.fastq"
## [15,] "F3D5"   "S193" "L001" "R1" "001.fastq"
## [16,] "F3D6"   "S194" "L001" "R1" "001.fastq"
## [17,] "F3D7"   "S195" "L001" "R1" "001.fastq"
## [18,] "F3D8"   "S196" "L001" "R1" "001.fastq"
## [19,] "F3D9"   "S197" "L001" "R1" "001.fastq"
## [20,] "Mock"   "S280" "L001" "R1" "001.fastq"

if you choose 1st column from the data, using .[,1], it wlll show the name of files after removing all the characters after the first underscore _.

forward_read_files %>% 
        basename() %>% 
        str_split_fixed("_", 5) %>% 
        .[,1]
##  [1] "F3D0"   "F3D1"   "F3D141" "F3D142" "F3D143" "F3D144" "F3D145" "F3D146"
##  [9] "F3D147" "F3D148" "F3D149" "F3D150" "F3D2"   "F3D3"   "F3D5"   "F3D6"  
## [17] "F3D7"   "F3D8"   "F3D9"   "Mock"

I am going to store this list of file names for future use.

sample.names <- forward_read_files %>% 
        basename() %>% 
        str_split_fixed("_", 5) %>% 
        .[,1]

Plotting

DADA2 have a cool function called plotQualityProfile(), which will automatically plot QC scores by length of all the files that were listed. Such as,

If I want to plot the QC profile of the f

forward_read_files[1]
## [1] "/Users/minsikkim/Dropbox (Personal)/Inha/5_Lectures/Advanced metagenomics/IBS7048_dataset/F3D0_S188_L001_R1_001.fastq"
plotQualityProfile(forward_read_files[1])

It will show the total reads, QC score, and the length of sequencing reads at the same time.

We can do this for multiple samples, by using

plotQualityProfile(
        c(forward_read_files[1], forward_read_files[2])
)

or

plotQualityProfile(forward_read_files[1:2])

or

plotQualityProfile(forward_read_files[1:10])

Let’s check reverse reads as well.

plotQualityProfile(reverse_read_files[1:10])

As you can see, we have high quality forward reads but low quality reverse reads. These low quality reads need to be removed before mering these files into one complementary read file.

Filter and trim

Before making filtered reads, let’s set an location and names for them.

Using paste() function, we can manipulate charater variables like ties.

paste0(sample.names, "_sample")
##  [1] "F3D0_sample"   "F3D1_sample"   "F3D141_sample" "F3D142_sample"
##  [5] "F3D143_sample" "F3D144_sample" "F3D145_sample" "F3D146_sample"
##  [9] "F3D147_sample" "F3D148_sample" "F3D149_sample" "F3D150_sample"
## [13] "F3D2_sample"   "F3D3_sample"   "F3D5_sample"   "F3D6_sample"  
## [17] "F3D7_sample"   "F3D8_sample"   "F3D9_sample"   "Mock_sample"

Using paste0, we can create a list of new names at once.

# Place filtered files in filtered/ subdirectory
filtered_forward_reads <- file.path(minsik_path, "filtered", paste0(sample.names, "_F_filt.fastq.gz"))
filtered_reverse_reads <- file.path(minsik_path, "filtered", paste0(sample.names, "_R_filt.fastq.gz"))

We can also assign names of each element in that list, using names() function.

names(filtered_forward_reads) <- sample.names
names(filtered_reverse_reads) <- sample.names

Now, we are using the dada2 filtering function.

filterAndTrim() have multiple options for filtering.

  1. the path of original file (forward)
  2. the path of new file (forward)
  3. the path of original file (reverse)
  4. the path of new file (reverse)
  5. truncLen: Reads after 160th bp showed lower QC scores. Lets remove them, from 160 to 250.
  6. maxEE: Maximum error after truncation
  7. compress=TRUE: The output files will be gzipped. 8 multithread=TRUE: Turn on this option if you are using Windows OS

The maxEE parameter sets the maximum number of “expected errors” allowed in a read, which is a better filter than simply averaging quality scores.

out <- filterAndTrim(forward_read_files, #name of forward raw reads
                     filtered_forward_reads, #name of filtered forward reads 
                     reverse_read_files,
                     filtered_reverse_reads,
                     truncLen=c(250,160),# Reads after 160th bp showed lower QC scores. Lets remove them!
                     compress=TRUE,
                     #multithread=TRUE, # On Windows set multithread=FALSE
                     maxEE=c(2,2))
head(out)
##                               reads.in reads.out
## F3D0_S188_L001_R1_001.fastq       7793      7029
## F3D1_S189_L001_R1_001.fastq       5869      5216
## F3D141_S207_L001_R1_001.fastq     5958      5400
## F3D142_S208_L001_R1_001.fastq     3183      2897
## F3D143_S209_L001_R1_001.fastq     3178      2912
## F3D144_S210_L001_R1_001.fastq     4827      4262

You can see some reads were removed from the fastq file!

and they are now stored in a new directory.

list.files(path = "/Users/minsikkim/Dropbox (Personal)/Inha/5_Lectures/Advanced metagenomics/IBS7048_dataset/filtered/")
##  [1] "F3D0_F_filt.fastq.gz"   "F3D0_R_filt.fastq.gz"   "F3D1_F_filt.fastq.gz"  
##  [4] "F3D1_R_filt.fastq.gz"   "F3D141_F_filt.fastq.gz" "F3D141_R_filt.fastq.gz"
##  [7] "F3D142_F_filt.fastq.gz" "F3D142_R_filt.fastq.gz" "F3D143_F_filt.fastq.gz"
## [10] "F3D143_R_filt.fastq.gz" "F3D144_F_filt.fastq.gz" "F3D144_R_filt.fastq.gz"
## [13] "F3D145_F_filt.fastq.gz" "F3D145_R_filt.fastq.gz" "F3D146_F_filt.fastq.gz"
## [16] "F3D146_R_filt.fastq.gz" "F3D147_F_filt.fastq.gz" "F3D147_R_filt.fastq.gz"
## [19] "F3D148_F_filt.fastq.gz" "F3D148_R_filt.fastq.gz" "F3D149_F_filt.fastq.gz"
## [22] "F3D149_R_filt.fastq.gz" "F3D150_F_filt.fastq.gz" "F3D150_R_filt.fastq.gz"
## [25] "F3D2_F_filt.fastq.gz"   "F3D2_R_filt.fastq.gz"   "F3D3_F_filt.fastq.gz"  
## [28] "F3D3_R_filt.fastq.gz"   "F3D5_F_filt.fastq.gz"   "F3D5_R_filt.fastq.gz"  
## [31] "F3D6_F_filt.fastq.gz"   "F3D6_R_filt.fastq.gz"   "F3D7_F_filt.fastq.gz"  
## [34] "F3D7_R_filt.fastq.gz"   "F3D8_F_filt.fastq.gz"   "F3D8_R_filt.fastq.gz"  
## [37] "F3D9_F_filt.fastq.gz"   "F3D9_R_filt.fastq.gz"   "Mock_F_filt.fastq.gz"  
## [40] "Mock_R_filt.fastq.gz"

Error prediction

The DADA2 algorithm makes use of a parametric error model (err) and every amplicon dataset has a different set of error rates. The learnErrors method learns this error model from the data, by alternating estimation of the error rates and inference of sample composition until they converge on a jointly consistent solution. As in many machine-learning problems, the algorithm must begin with an initial guess, for which the maximum possible error rates in this data are used (the error rates if only the most abundant sequence is correct and all the rest are errors).

error_F <- learnErrors(filtered_forward_reads
                       #, multithread=TRUE #
                       )
## 34523750 total bases in 138095 reads from 20 samples will be used for learning the error rates.
error_R <- learnErrors(filtered_reverse_reads
                       #, multithread=TRUE
                       )
## 22095200 total bases in 138095 reads from 20 samples will be used for learning the error rates.

We can double-chekc when the error is ocurring in our sequencing file, using plotErrors() function.

plotErrors(error_F, nominalQ=TRUE)

Using this predicted error (which can be also called as expected error) in adjusting the actuall error happedn in our data set. In other words, errors that is higher than the expectation will be strongly filtered.

Use dada() function to remove errors and get unique sequences!

forward_dada <- dada(filtered_forward_reads, err=error_F)
## Sample 1 - 7029 reads in 2157 unique sequences.
## Sample 2 - 5216 reads in 1755 unique sequences.
## Sample 3 - 5400 reads in 1584 unique sequences.
## Sample 4 - 2897 reads in 976 unique sequences.
## Sample 5 - 2912 reads in 1004 unique sequences.
## Sample 6 - 4262 reads in 1362 unique sequences.
## Sample 7 - 6656 reads in 1872 unique sequences.
## Sample 8 - 4499 reads in 1547 unique sequences.
## Sample 9 - 15469 reads in 3893 unique sequences.
## Sample 10 - 11270 reads in 2989 unique sequences.
## Sample 11 - 11875 reads in 3276 unique sequences.
## Sample 12 - 4985 reads in 1684 unique sequences.
## Sample 13 - 17906 reads in 4052 unique sequences.
## Sample 14 - 6184 reads in 1595 unique sequences.
## Sample 15 - 4007 reads in 1305 unique sequences.
## Sample 16 - 7292 reads in 1996 unique sequences.
## Sample 17 - 4724 reads in 1263 unique sequences.
## Sample 18 - 4817 reads in 1495 unique sequences.
## Sample 19 - 6452 reads in 1871 unique sequences.
## Sample 20 - 4243 reads in 937 unique sequences.
#multithread=TRUE for windows
reverse_dada <- dada(filtered_reverse_reads, err=error_R)
## Sample 1 - 7029 reads in 1622 unique sequences.
## Sample 2 - 5216 reads in 1311 unique sequences.
## Sample 3 - 5400 reads in 1311 unique sequences.
## Sample 4 - 2897 reads in 843 unique sequences.
## Sample 5 - 2912 reads in 866 unique sequences.
## Sample 6 - 4262 reads in 1265 unique sequences.
## Sample 7 - 6656 reads in 1765 unique sequences.
## Sample 8 - 4499 reads in 1234 unique sequences.
## Sample 9 - 15469 reads in 3355 unique sequences.
## Sample 10 - 11270 reads in 2462 unique sequences.
## Sample 11 - 11875 reads in 2704 unique sequences.
## Sample 12 - 4985 reads in 1389 unique sequences.
## Sample 13 - 17906 reads in 3215 unique sequences.
## Sample 14 - 6184 reads in 1365 unique sequences.
## Sample 15 - 4007 reads in 1108 unique sequences.
## Sample 16 - 7292 reads in 1598 unique sequences.
## Sample 17 - 4724 reads in 1062 unique sequences.
## Sample 18 - 4817 reads in 1135 unique sequences.
## Sample 19 - 6452 reads in 1472 unique sequences.
## Sample 20 - 4243 reads in 707 unique sequences.
#multithread=TRUE for windows
forward_dada[[1]]
## dada-class: object describing DADA2 denoising results
## 124 sequence variants were inferred from 2157 input unique sequences.
## Key parameters: OMEGA_A = 1e-40, OMEGA_C = 1e-40, BAND_SIZE = 16

Now, we have sequencing variants after considering errors.

Mering paired reads

Unique sequences can be merged into single, complementary reads (forward + reverse) using mergePairs() function.

mergers <- mergePairs(forward_dada, 
                      filtered_forward_reads,
                      reverse_dada,
                      filtered_reverse_reads,
                      verbose=TRUE)
# Inspect the merger data.frame from the first sample
head(mergers[[1]])
##                                                                                                                                                                                                                                                       sequence
## 1 TACGGAGGATGCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGTGCGCAGGCGGAAGATCAAGTCAGCGGTAAAATTGAGAGGCTCAACCTCTTCGAGCCGTTGAAACTGGTTTTCTTGAGTGAGCGAGAAGTATGCGGAATGCGTGGTGTAGCGGTGAAATGCATAGATATCACGCAGAACTCCGATTGCGAAGGCAGCATACCGGCGCTCAACTGACGCTCATGCACGAAAGTGTGGGTATCGAACAGG
## 2 TACGGAGGATGCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGTGCGTAGGCGGCCTGCCAAGTCAGCGGTAAAATTGCGGGGCTCAACCCCGTACAGCCGTTGAAACTGCCGGGCTCGAGTGGGCGAGAAGTATGCGGAATGCGTGGTGTAGCGGTGAAATGCATAGATATCACGCAGAACCCCGATTGCGAAGGCAGCATACCGGCGCCCTACTGACGCTGAGGCACGAAAGTGCGGGGATCAAACAGG
## 3 TACGGAGGATGCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGTGCGTAGGCGGGCTGTTAAGTCAGCGGTCAAATGTCGGGGCTCAACCCCGGCCTGCCGTTGAAACTGGCGGCCTCGAGTGGGCGAGAAGTATGCGGAATGCGTGGTGTAGCGGTGAAATGCATAGATATCACGCAGAACTCCGATTGCGAAGGCAGCATACCGGCGCCCGACTGACGCTGAGGCACGAAAGCGTGGGTATCGAACAGG
## 4 TACGGAGGATGCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGTGCGTAGGCGGGCTTTTAAGTCAGCGGTAAAAATTCGGGGCTCAACCCCGTCCGGCCGTTGAAACTGGGGGCCTTGAGTGGGCGAGAAGAAGGCGGAATGCGTGGTGTAGCGGTGAAATGCATAGATATCACGCAGAACCCCGATTGCGAAGGCAGCCTTCCGGCGCCCTACTGACGCTGAGGCACGAAAGTGCGGGGATCGAACAGG
## 5 TACGGAGGATGCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGTGCGCAGGCGGACTCTCAAGTCAGCGGTCAAATCGCGGGGCTCAACCCCGTTCCGCCGTTGAAACTGGGAGCCTTGAGTGCGCGAGAAGTAGGCGGAATGCGTGGTGTAGCGGTGAAATGCATAGATATCACGCAGAACTCCGATTGCGAAGGCAGCCTACCGGCGCGCAACTGACGCTCATGCACGAAAGCGTGGGTATCGAACAGG
## 6 TACGGAGGATGCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGTGCGTAGGCGGGATGCCAAGTCAGCGGTAAAAAAGCGGTGCTCAACGCCGTCGAGCCGTTGAAACTGGCGTTCTTGAGTGGGCGAGAAGTATGCGGAATGCGTGGTGTAGCGGTGAAATGCATAGATATCACGCAGAACTCCGATTGCGAAGGCAGCATACCGGCGCCCTACTGACGCTGAGGCACGAAAGCGTGGGTATCGAACAGG
##   abundance forward reverse nmatch nmismatch nindel prefer accept
## 1       577       1       1    158         0      0      1   TRUE
## 2       467       2       2    158         0      0      2   TRUE
## 3       444       3       4    158         0      0      1   TRUE
## 4       427       4       3    158         0      0      2   TRUE
## 5       345       5       6    158         0      0      1   TRUE
## 6       282       6       5    158         0      0      2   TRUE

Sequence table

Table of sequences

seqtab <- makeSequenceTable(mergers)
dim(seqtab)
## [1]  20 291

20 fiels, 291 unique sequencing reads

Length of merged sequences

table(nchar(getSequences(seqtab)))
## 
## 251 252 253 254 255 
##   1  93 189   6   2

Remove chimeras

The core dada method corrects substitution and indel errors, but chimeras remain. Fortunately, the accuracy of sequence variants after denoising makes identifying chimeric ASVs simpler than when dealing with fuzzy OTUs. Chimeric sequences are identified if they can be exactly reconstructed by combining a left-segment and a right-segment from two more abundant “parent” sequences.

seqtab.nochim <- removeBimeraDenovo(seqtab, method="consensus", multithread=TRUE, verbose=TRUE)
dim(seqtab.nochim)
## [1]  20 225

291-225 = 66 reads were chimeras.

sum(seqtab.nochim)/sum(seqtab)
## [1] 0.9612192

4% of merged readsa were chimeras.

Track reads through the pipeline

As a final check of our progress, we’ll look at the number of reads that made it through each step in the pipeline:

getN <- function(x) sum(getUniques(x))
track <- cbind(out,
               sapply(forward_dada, getN),
               sapply(reverse_dada, getN), 
               sapply(mergers, getN), rowSums(seqtab.nochim))
# If processing a single sample, remove the sapply calls: e.g. replace sapply(dadaFs, getN) with getN(dadaFs)
colnames(track) <- c("input", "filtered", "denoisedF", "denoisedR", "merged", "nonchim")
rownames(track) <- sample.names
head(track)
##        input filtered denoisedF denoisedR merged nonchim
## F3D0    7793     7029      6867      6891   6446    6424
## F3D1    5869     5216      5079      5166   4862    4851
## F3D141  5958     5400      5252      5295   4926    4788
## F3D142  3183     2897      2768      2814   2569    2495
## F3D143  3178     2912      2776      2838   2487    2453
## F3D144  4827     4262      4122      4180   3640    3440

Assign taxonomy

It is common at this point, especially in 16S/18S/ITS amplicon sequencing, to assign taxonomy to the sequence variants. The DADA2 package provides a native implementation of the naive Bayesian classifier method for this purpose. The assignTaxonomy function takes as input a set of sequences to be classified and a training set of reference sequences with known taxonomy, and outputs taxonomic assignments with at least minBoot bootstrap confidence.

We maintain formatted training fastas for the RDP training set, GreenGenes clustered at 97% identity, and the Silva reference database, and additional trainings fastas suitable for protists and certain specific environments have been contributed. For fungal taxonomy, the General Fasta release files from the UNITE ITS database can be used as is. To follow along, download the silva_nr_v132_train_set.fa.gz file, and place it in the directory with the fastq files.

taxa <- assignTaxonomy(seqtab.nochim,
                       paste0(minsik_path, "/silva_nr99_v138.1_train_set.fa"),
                       multithread=TRUE)
view(taxa)

Read count information (to compare their abundances between taxa within sample)

view(seqtab.nochim)

Bibliography

## Computing. R Foundation for Statistical Computing, Vienna, Austria. <https://www.R-project.org/>. We have invested a lot of time and effort in creating R, please cite it when using it for data analysis. See also 'citation("pkgname")' for citing R packages.
## Grolemund G, Hayes A, Henry L, Hester J, Kuhn M, Pedersen TL, Miller E, Bache SM, Müller K, Ooms J, Robinson D, Seidel DP, Spinu V, Takahashi K, Vaughan D, Wilke C, Woo K, Yutani H (2019). "Welcome to the tidyverse." Journal of Open Source Software_, *4*(43), 1686. doi:10.21105/joss.01686 <https://doi.org/10.21105/joss.01686>.
## R. version 0.5.0. Buffalo, New York. http://github.com/trinker/pacman
## J, reikoch, Beasley W, O'Connor B, Warnes GR, Quinn M, Kamvar ZN (2023). yaml: Methods to Convert R Data to YAML and Back_. R package version 2.3.7, <https://CRAN.R-project.org/package=yaml>. ATTENTION: This citation information has been auto-generated from the package DESCRIPTION file and may need manual editing, see 'help("citation")'.