DADA2
You can find the original DADA pipeline at
https://benjjneb.github.io/dada2/
Meanwhile, I will add
Installing DADA2
Binaries for the current release version of DADA2 (1.26) are available from Bioconductor. Note that you must have R 4.2.0 or newer, and Bioconductor version 3.16, to install the most current release from Bioconductor.
If you are using R4.3 or higher, you should install dada2 version 3.17.
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("dada2", version = "3.17")
Load DADA2 to your environment
Now, your DADA2 package was installed to your computer. That does not mean it is loaded to your working environment.
To double-check the loaded packages, you can run
sessionInfo()
.
sessionInfo()
## R version 4.3.1 (2023-06-16)
## Platform: x86_64-apple-darwin20 (64-bit)
## Running under: macOS Ventura 13.5
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/lib/libRlapack.dylib; LAPACK version 3.11.0
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## time zone: Asia/Seoul
## tzcode source: internal
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] yaml_2.3.7 lubridate_1.9.2 forcats_1.0.0 stringr_1.5.0
## [5] dplyr_1.1.3 purrr_1.0.2 readr_2.1.4 tidyr_1.3.0
## [9] tibble_3.2.1 ggplot2_3.4.3 tidyverse_2.0.0 pacman_0.5.1
##
## loaded via a namespace (and not attached):
## [1] gtable_0.3.4 jsonlite_1.8.7 compiler_4.3.1
## [4] BiocManager_1.30.22 tidyselect_1.2.0 jquerylib_0.1.4
## [7] scales_1.2.1 fastmap_1.1.1 R6_2.5.1
## [10] generics_0.1.3 knitr_1.44 bookdown_0.35
## [13] munsell_0.5.0 tzdb_0.4.0 bslib_0.5.1
## [16] pillar_1.9.0 rlang_1.1.1 utf8_1.2.3
## [19] stringi_1.7.12 cachem_1.0.8 xfun_0.40
## [22] sass_0.4.7 timechange_0.2.0 cli_3.6.1
## [25] withr_2.5.1 magrittr_2.0.3 rmdformats_1.0.4
## [28] digest_0.6.33 grid_4.3.1 rstudioapi_0.15.0
## [31] hms_1.1.3 lifecycle_1.0.3 vctrs_0.6.3
## [34] evaluate_0.21 glue_1.6.2 fansi_1.0.4
## [37] colorspace_2.1-0 rmarkdown_2.25 tools_4.3.1
## [40] pkgconfig_2.0.3 htmltools_0.5.6
As you can see, DADA2 is not loaded. To load package, use
library()
function.
library(dada2)
sessionInfo()
## R version 4.3.1 (2023-06-16)
## Platform: x86_64-apple-darwin20 (64-bit)
## Running under: macOS Ventura 13.5
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/lib/libRlapack.dylib; LAPACK version 3.11.0
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## time zone: Asia/Seoul
## tzcode source: internal
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] dada2_1.28.0 Rcpp_1.0.11 yaml_2.3.7 lubridate_1.9.2
## [5] forcats_1.0.0 stringr_1.5.0 dplyr_1.1.3 purrr_1.0.2
## [9] readr_2.1.4 tidyr_1.3.0 tibble_3.2.1 ggplot2_3.4.3
## [13] tidyverse_2.0.0 pacman_0.5.1
##
## loaded via a namespace (and not attached):
## [1] tidyselect_1.2.0 Biostrings_2.68.1
## [3] bitops_1.0-7 fastmap_1.1.1
## [5] RCurl_1.98-1.12 GenomicAlignments_1.36.0
## [7] digest_0.6.33 timechange_0.2.0
## [9] lifecycle_1.0.3 magrittr_2.0.3
## [11] compiler_4.3.1 rlang_1.1.1
## [13] sass_0.4.7 tools_4.3.1
## [15] utf8_1.2.3 knitr_1.44
## [17] S4Arrays_1.0.6 interp_1.1-4
## [19] DelayedArray_0.26.7 plyr_1.8.8
## [21] RColorBrewer_1.1-3 ShortRead_1.58.0
## [23] abind_1.4-5 BiocParallel_1.34.2
## [25] withr_2.5.1 hwriter_1.3.2.1
## [27] BiocGenerics_0.46.0 grid_4.3.1
## [29] stats4_4.3.1 fansi_1.0.4
## [31] latticeExtra_0.6-30 colorspace_2.1-0
## [33] scales_1.2.1 SummarizedExperiment_1.30.2
## [35] cli_3.6.1 rmarkdown_2.25
## [37] crayon_1.5.2 generics_0.1.3
## [39] RcppParallel_5.1.7 rstudioapi_0.15.0
## [41] reshape2_1.4.4 tzdb_0.4.0
## [43] cachem_1.0.8 zlibbioc_1.46.0
## [45] parallel_4.3.1 BiocManager_1.30.22
## [47] XVector_0.40.0 matrixStats_1.0.0
## [49] rmdformats_1.0.4 vctrs_0.6.3
## [51] Matrix_1.6-1.1 jsonlite_1.8.7
## [53] bookdown_0.35 IRanges_2.34.1
## [55] hms_1.1.3 S4Vectors_0.38.2
## [57] jpeg_0.1-10 jquerylib_0.1.4
## [59] glue_1.6.2 codetools_0.2-19
## [61] stringi_1.7.12 gtable_0.3.4
## [63] deldir_1.0-9 GenomeInfoDb_1.36.3
## [65] GenomicRanges_1.52.0 munsell_0.5.0
## [67] pillar_1.9.0 htmltools_0.5.6
## [69] GenomeInfoDbData_1.2.10 R6_2.5.1
## [71] evaluate_0.21 Biobase_2.60.0
## [73] lattice_0.21-8 png_0.1-8
## [75] Rsamtools_2.16.0 bslib_0.5.1
## [77] xfun_0.40 MatrixGenerics_1.12.3
## [79] pkgconfig_2.0.3
You can see the dada2 package was loaded.
Opening multiple files for dada2 in R
Without pointing the exact path of a file that you want to play with, R will employ thing at your current working directory.
getwd()
will show you the current directory that you are
working on.
getwd()
## [1] "/Users/minsikkim/Dropbox (Personal)/Inha/5_Lectures/Advanced biostatistics/scripts/BTE3207_Advanced_Biostatistics"
list.files()
function shows all the files in the current
list.
list.files()
## [1] "BTE3207_Advanced_Biostatistics.Rproj"
## [2] "COD_week1_2_MGK_BTE3207.html"
## [3] "COD_week1_2_MGK_BTE3207.R"
## [4] "COD_week1_2_MGK_BTE3207.Rmd"
## [5] "COD_week2_1_MGK_BTE3207.html"
## [6] "COD_week2_1_MGK_BTE3207.Rmd"
## [7] "COD_week2_2_MGK_BTE3207.html"
## [8] "COD_week2_2_MGK_BTE3207.Rmd"
## [9] "COD_week4_1_MGK_BTE3207.html"
## [10] "COD_week4_1_MGK_BTE3207.R"
## [11] "COD_week4_1_MGK_BTE3207.Rmd"
## [12] "COD_week4_2_MGK_BTE3207.html"
## [13] "COD_week5_1_MGK_BTE3207.html"
## [14] "COD_week5_1_MGK_BTE3207.Rmd"
## [15] "COD_week5_2_MGK_BTE3207.Rmd"
## [16] "confounder_plots.pdf"
## [17] "dataset"
## [18] "interpretation.pdf"
## [19] "README.md"
## [20] "rsconnect"
You can also set the path of a folder of your interest for
list.files()
.
list.files("/Users/minsikkim/Dropbox (Personal)/Inha/5_Lectures/Advanced metagenomics/IBS7048_dataset")
## [1] "F3D0_S188_L001_R1_001.fastq" "F3D0_S188_L001_R2_001.fastq"
## [3] "F3D1_S189_L001_R1_001.fastq" "F3D1_S189_L001_R2_001.fastq"
## [5] "F3D141_S207_L001_R1_001.fastq" "F3D141_S207_L001_R2_001.fastq"
## [7] "F3D142_S208_L001_R1_001.fastq" "F3D142_S208_L001_R2_001.fastq"
## [9] "F3D143_S209_L001_R1_001.fastq" "F3D143_S209_L001_R2_001.fastq"
## [11] "F3D144_S210_L001_R1_001.fastq" "F3D144_S210_L001_R2_001.fastq"
## [13] "F3D145_S211_L001_R1_001.fastq" "F3D145_S211_L001_R2_001.fastq"
## [15] "F3D146_S212_L001_R1_001.fastq" "F3D146_S212_L001_R2_001.fastq"
## [17] "F3D147_S213_L001_R1_001.fastq" "F3D147_S213_L001_R2_001.fastq"
## [19] "F3D148_S214_L001_R1_001.fastq" "F3D148_S214_L001_R2_001.fastq"
## [21] "F3D149_S215_L001_R1_001.fastq" "F3D149_S215_L001_R2_001.fastq"
## [23] "F3D150_S216_L001_R1_001.fastq" "F3D150_S216_L001_R2_001.fastq"
## [25] "F3D2_S190_L001_R1_001.fastq" "F3D2_S190_L001_R2_001.fastq"
## [27] "F3D3_S191_L001_R1_001.fastq" "F3D3_S191_L001_R2_001.fastq"
## [29] "F3D5_S193_L001_R1_001.fastq" "F3D5_S193_L001_R2_001.fastq"
## [31] "F3D6_S194_L001_R1_001.fastq" "F3D6_S194_L001_R2_001.fastq"
## [33] "F3D7_S195_L001_R1_001.fastq" "F3D7_S195_L001_R2_001.fastq"
## [35] "F3D8_S196_L001_R1_001.fastq" "F3D8_S196_L001_R2_001.fastq"
## [37] "F3D9_S197_L001_R1_001.fastq" "F3D9_S197_L001_R2_001.fastq"
## [39] "filtered" "HMP_MOCK.v35.fasta"
## [41] "Mock_S280_L001_R1_001.fastq" "Mock_S280_L001_R2_001.fastq"
## [43] "mouse.dpw.metadata" "mouse.time.design"
## [45] "silva_nr99_v138.1_train_set.fa" "silva_nr99_v138.1_train_set.fa.zip"
## [47] "stability.batch" "stability.files"
This is my directory with the example files.
You can also navigate folders in R. Try typing
list.files("/")
and press tab
.
Anyway, I can set name this path as path
, as it is too
long.
minsik_path = "/Users/minsikkim/Dropbox (Personal)/Inha/5_Lectures/Advanced metagenomics/IBS7048_dataset"
# you can also use `<-` instead of `=`.
This is my directory with the example files.
As now path
is having the information of my location, I
can substitute the long path name with path
list.files(minsik_path)
## [1] "F3D0_S188_L001_R1_001.fastq" "F3D0_S188_L001_R2_001.fastq"
## [3] "F3D1_S189_L001_R1_001.fastq" "F3D1_S189_L001_R2_001.fastq"
## [5] "F3D141_S207_L001_R1_001.fastq" "F3D141_S207_L001_R2_001.fastq"
## [7] "F3D142_S208_L001_R1_001.fastq" "F3D142_S208_L001_R2_001.fastq"
## [9] "F3D143_S209_L001_R1_001.fastq" "F3D143_S209_L001_R2_001.fastq"
## [11] "F3D144_S210_L001_R1_001.fastq" "F3D144_S210_L001_R2_001.fastq"
## [13] "F3D145_S211_L001_R1_001.fastq" "F3D145_S211_L001_R2_001.fastq"
## [15] "F3D146_S212_L001_R1_001.fastq" "F3D146_S212_L001_R2_001.fastq"
## [17] "F3D147_S213_L001_R1_001.fastq" "F3D147_S213_L001_R2_001.fastq"
## [19] "F3D148_S214_L001_R1_001.fastq" "F3D148_S214_L001_R2_001.fastq"
## [21] "F3D149_S215_L001_R1_001.fastq" "F3D149_S215_L001_R2_001.fastq"
## [23] "F3D150_S216_L001_R1_001.fastq" "F3D150_S216_L001_R2_001.fastq"
## [25] "F3D2_S190_L001_R1_001.fastq" "F3D2_S190_L001_R2_001.fastq"
## [27] "F3D3_S191_L001_R1_001.fastq" "F3D3_S191_L001_R2_001.fastq"
## [29] "F3D5_S193_L001_R1_001.fastq" "F3D5_S193_L001_R2_001.fastq"
## [31] "F3D6_S194_L001_R1_001.fastq" "F3D6_S194_L001_R2_001.fastq"
## [33] "F3D7_S195_L001_R1_001.fastq" "F3D7_S195_L001_R2_001.fastq"
## [35] "F3D8_S196_L001_R1_001.fastq" "F3D8_S196_L001_R2_001.fastq"
## [37] "F3D9_S197_L001_R1_001.fastq" "F3D9_S197_L001_R2_001.fastq"
## [39] "filtered" "HMP_MOCK.v35.fasta"
## [41] "Mock_S280_L001_R1_001.fastq" "Mock_S280_L001_R2_001.fastq"
## [43] "mouse.dpw.metadata" "mouse.time.design"
## [45] "silva_nr99_v138.1_train_set.fa" "silva_nr99_v138.1_train_set.fa.zip"
## [47] "stability.batch" "stability.files"
Chooing files with pattern
From the list of files, we only want to select .fastq
files. For that, I am going to make a list of files.
Since this file list is having forward and reverse reads per one sample set, we need to separate those files into two groups as well.
list.files(minsik_path, pattern = "_R1_001")
## [1] "F3D0_S188_L001_R1_001.fastq" "F3D1_S189_L001_R1_001.fastq"
## [3] "F3D141_S207_L001_R1_001.fastq" "F3D142_S208_L001_R1_001.fastq"
## [5] "F3D143_S209_L001_R1_001.fastq" "F3D144_S210_L001_R1_001.fastq"
## [7] "F3D145_S211_L001_R1_001.fastq" "F3D146_S212_L001_R1_001.fastq"
## [9] "F3D147_S213_L001_R1_001.fastq" "F3D148_S214_L001_R1_001.fastq"
## [11] "F3D149_S215_L001_R1_001.fastq" "F3D150_S216_L001_R1_001.fastq"
## [13] "F3D2_S190_L001_R1_001.fastq" "F3D3_S191_L001_R1_001.fastq"
## [15] "F3D5_S193_L001_R1_001.fastq" "F3D6_S194_L001_R1_001.fastq"
## [17] "F3D7_S195_L001_R1_001.fastq" "F3D8_S196_L001_R1_001.fastq"
## [19] "F3D9_S197_L001_R1_001.fastq" "Mock_S280_L001_R1_001.fastq"
Using pattern =
argument will make a subset of list of
files including the pattern noted in the quotation mark
""
.
With another argument, full.names = TRUE
, it will give
you the list of full-path-names of the files with the specified
pattern.
list.files(minsik_path, pattern = "_R1_001", full.names = TRUE)
## [1] "/Users/minsikkim/Dropbox (Personal)/Inha/5_Lectures/Advanced metagenomics/IBS7048_dataset/F3D0_S188_L001_R1_001.fastq"
## [2] "/Users/minsikkim/Dropbox (Personal)/Inha/5_Lectures/Advanced metagenomics/IBS7048_dataset/F3D1_S189_L001_R1_001.fastq"
## [3] "/Users/minsikkim/Dropbox (Personal)/Inha/5_Lectures/Advanced metagenomics/IBS7048_dataset/F3D141_S207_L001_R1_001.fastq"
## [4] "/Users/minsikkim/Dropbox (Personal)/Inha/5_Lectures/Advanced metagenomics/IBS7048_dataset/F3D142_S208_L001_R1_001.fastq"
## [5] "/Users/minsikkim/Dropbox (Personal)/Inha/5_Lectures/Advanced metagenomics/IBS7048_dataset/F3D143_S209_L001_R1_001.fastq"
## [6] "/Users/minsikkim/Dropbox (Personal)/Inha/5_Lectures/Advanced metagenomics/IBS7048_dataset/F3D144_S210_L001_R1_001.fastq"
## [7] "/Users/minsikkim/Dropbox (Personal)/Inha/5_Lectures/Advanced metagenomics/IBS7048_dataset/F3D145_S211_L001_R1_001.fastq"
## [8] "/Users/minsikkim/Dropbox (Personal)/Inha/5_Lectures/Advanced metagenomics/IBS7048_dataset/F3D146_S212_L001_R1_001.fastq"
## [9] "/Users/minsikkim/Dropbox (Personal)/Inha/5_Lectures/Advanced metagenomics/IBS7048_dataset/F3D147_S213_L001_R1_001.fastq"
## [10] "/Users/minsikkim/Dropbox (Personal)/Inha/5_Lectures/Advanced metagenomics/IBS7048_dataset/F3D148_S214_L001_R1_001.fastq"
## [11] "/Users/minsikkim/Dropbox (Personal)/Inha/5_Lectures/Advanced metagenomics/IBS7048_dataset/F3D149_S215_L001_R1_001.fastq"
## [12] "/Users/minsikkim/Dropbox (Personal)/Inha/5_Lectures/Advanced metagenomics/IBS7048_dataset/F3D150_S216_L001_R1_001.fastq"
## [13] "/Users/minsikkim/Dropbox (Personal)/Inha/5_Lectures/Advanced metagenomics/IBS7048_dataset/F3D2_S190_L001_R1_001.fastq"
## [14] "/Users/minsikkim/Dropbox (Personal)/Inha/5_Lectures/Advanced metagenomics/IBS7048_dataset/F3D3_S191_L001_R1_001.fastq"
## [15] "/Users/minsikkim/Dropbox (Personal)/Inha/5_Lectures/Advanced metagenomics/IBS7048_dataset/F3D5_S193_L001_R1_001.fastq"
## [16] "/Users/minsikkim/Dropbox (Personal)/Inha/5_Lectures/Advanced metagenomics/IBS7048_dataset/F3D6_S194_L001_R1_001.fastq"
## [17] "/Users/minsikkim/Dropbox (Personal)/Inha/5_Lectures/Advanced metagenomics/IBS7048_dataset/F3D7_S195_L001_R1_001.fastq"
## [18] "/Users/minsikkim/Dropbox (Personal)/Inha/5_Lectures/Advanced metagenomics/IBS7048_dataset/F3D8_S196_L001_R1_001.fastq"
## [19] "/Users/minsikkim/Dropbox (Personal)/Inha/5_Lectures/Advanced metagenomics/IBS7048_dataset/F3D9_S197_L001_R1_001.fastq"
## [20] "/Users/minsikkim/Dropbox (Personal)/Inha/5_Lectures/Advanced metagenomics/IBS7048_dataset/Mock_S280_L001_R1_001.fastq"
So, now what?
We are going to store their list of names of files, so that some other function can use that information for loading that data at once.
forward_read_files <- list.files(minsik_path, pattern="_R1_001.fastq", full.names = TRUE)
reverse_read_files <- list.files(minsik_path, pattern="_R2_001.fastq", full.names = TRUE)
Meanwhile, they can be having different orders. So it would be helpful if we can sort the data right after loading it.
library(tidyverse)
forward_read_files <- list.files(minsik_path, pattern="_R1_001.fastq", full.names = TRUE) %>% sort()
reverse_read_files <- list.files(minsik_path, pattern="_R2_001.fastq", full.names = TRUE) %>% sort()
Manipulating characters
Plus, after processing the data, we need to get
1 sample name per 1 sample
. It can be extracted from either
one of the forward and reverse reads.
basename()
removes all the higher path of file path.
forward_read_files %>%
basename()
## [1] "F3D0_S188_L001_R1_001.fastq" "F3D1_S189_L001_R1_001.fastq"
## [3] "F3D141_S207_L001_R1_001.fastq" "F3D142_S208_L001_R1_001.fastq"
## [5] "F3D143_S209_L001_R1_001.fastq" "F3D144_S210_L001_R1_001.fastq"
## [7] "F3D145_S211_L001_R1_001.fastq" "F3D146_S212_L001_R1_001.fastq"
## [9] "F3D147_S213_L001_R1_001.fastq" "F3D148_S214_L001_R1_001.fastq"
## [11] "F3D149_S215_L001_R1_001.fastq" "F3D150_S216_L001_R1_001.fastq"
## [13] "F3D2_S190_L001_R1_001.fastq" "F3D3_S191_L001_R1_001.fastq"
## [15] "F3D5_S193_L001_R1_001.fastq" "F3D6_S194_L001_R1_001.fastq"
## [17] "F3D7_S195_L001_R1_001.fastq" "F3D8_S196_L001_R1_001.fastq"
## [19] "F3D9_S197_L001_R1_001.fastq" "Mock_S280_L001_R1_001.fastq"
str_split_fixed()
will separate all the charaters
separated by the pattern specified in quotation mark ""
and
generate a matrix of the separated character. The number of separated
output will be set as you set. (you can think of the import wizard in
Excel).
forward_read_files %>%
basename() %>%
str_split_fixed(pattern = "_", n = 5)
## [,1] [,2] [,3] [,4] [,5]
## [1,] "F3D0" "S188" "L001" "R1" "001.fastq"
## [2,] "F3D1" "S189" "L001" "R1" "001.fastq"
## [3,] "F3D141" "S207" "L001" "R1" "001.fastq"
## [4,] "F3D142" "S208" "L001" "R1" "001.fastq"
## [5,] "F3D143" "S209" "L001" "R1" "001.fastq"
## [6,] "F3D144" "S210" "L001" "R1" "001.fastq"
## [7,] "F3D145" "S211" "L001" "R1" "001.fastq"
## [8,] "F3D146" "S212" "L001" "R1" "001.fastq"
## [9,] "F3D147" "S213" "L001" "R1" "001.fastq"
## [10,] "F3D148" "S214" "L001" "R1" "001.fastq"
## [11,] "F3D149" "S215" "L001" "R1" "001.fastq"
## [12,] "F3D150" "S216" "L001" "R1" "001.fastq"
## [13,] "F3D2" "S190" "L001" "R1" "001.fastq"
## [14,] "F3D3" "S191" "L001" "R1" "001.fastq"
## [15,] "F3D5" "S193" "L001" "R1" "001.fastq"
## [16,] "F3D6" "S194" "L001" "R1" "001.fastq"
## [17,] "F3D7" "S195" "L001" "R1" "001.fastq"
## [18,] "F3D8" "S196" "L001" "R1" "001.fastq"
## [19,] "F3D9" "S197" "L001" "R1" "001.fastq"
## [20,] "Mock" "S280" "L001" "R1" "001.fastq"
if you choose 1st column from the data, using .[,1]
, it
wlll show the name of files after removing all the characters after the
first underscore _
.
forward_read_files %>%
basename() %>%
str_split_fixed("_", 5) %>%
.[,1]
## [1] "F3D0" "F3D1" "F3D141" "F3D142" "F3D143" "F3D144" "F3D145" "F3D146"
## [9] "F3D147" "F3D148" "F3D149" "F3D150" "F3D2" "F3D3" "F3D5" "F3D6"
## [17] "F3D7" "F3D8" "F3D9" "Mock"
I am going to store this list of file names for future use.
sample.names <- forward_read_files %>%
basename() %>%
str_split_fixed("_", 5) %>%
.[,1]
Plotting
DADA2 have a cool function called plotQualityProfile()
,
which will automatically plot QC scores by length of all the files that
were listed. Such as,
If I want to plot the QC profile of the f
forward_read_files[1]
## [1] "/Users/minsikkim/Dropbox (Personal)/Inha/5_Lectures/Advanced metagenomics/IBS7048_dataset/F3D0_S188_L001_R1_001.fastq"
plotQualityProfile(forward_read_files[1])
It will show the total reads, QC score, and the length of sequencing reads at the same time.
We can do this for multiple samples, by using
plotQualityProfile(
c(forward_read_files[1], forward_read_files[2])
)
or
plotQualityProfile(forward_read_files[1:2])
or
plotQualityProfile(forward_read_files[1:10])
Let’s check reverse reads as well.
plotQualityProfile(reverse_read_files[1:10])
As you can see, we have high quality forward reads but low quality reverse reads. These low quality reads need to be removed before mering these files into one complementary read file.
Filter and trim
Before making filtered reads, let’s set an location and names for them.
Using paste()
function, we can manipulate charater
variables like ties.
paste0(sample.names, "_sample")
## [1] "F3D0_sample" "F3D1_sample" "F3D141_sample" "F3D142_sample"
## [5] "F3D143_sample" "F3D144_sample" "F3D145_sample" "F3D146_sample"
## [9] "F3D147_sample" "F3D148_sample" "F3D149_sample" "F3D150_sample"
## [13] "F3D2_sample" "F3D3_sample" "F3D5_sample" "F3D6_sample"
## [17] "F3D7_sample" "F3D8_sample" "F3D9_sample" "Mock_sample"
Using paste0, we can create a list of new names at once.
# Place filtered files in filtered/ subdirectory
filtered_forward_reads <- file.path(minsik_path, "filtered", paste0(sample.names, "_F_filt.fastq.gz"))
filtered_reverse_reads <- file.path(minsik_path, "filtered", paste0(sample.names, "_R_filt.fastq.gz"))
We can also assign names of each element in that list, using
names()
function.
names(filtered_forward_reads) <- sample.names
names(filtered_reverse_reads) <- sample.names
Now, we are using the dada2 filtering function.
filterAndTrim()
have multiple options for filtering.
- the path of original file (forward)
- the path of new file (forward)
- the path of original file (reverse)
- the path of new file (reverse)
- truncLen: Reads after 160th bp showed lower QC scores. Lets remove them, from 160 to 250.
- maxEE: Maximum error after truncation
- compress=TRUE: The output files will be gzipped. 8 multithread=TRUE: Turn on this option if you are using Windows OS
The maxEE parameter sets the maximum number of “expected errors” allowed in a read, which is a better filter than simply averaging quality scores.
out <- filterAndTrim(forward_read_files, #name of forward raw reads
filtered_forward_reads, #name of filtered forward reads
reverse_read_files,
filtered_reverse_reads,
truncLen=c(250,160),# Reads after 160th bp showed lower QC scores. Lets remove them!
compress=TRUE,
#multithread=TRUE, # On Windows set multithread=FALSE
maxEE=c(2,2))
head(out)
## reads.in reads.out
## F3D0_S188_L001_R1_001.fastq 7793 7029
## F3D1_S189_L001_R1_001.fastq 5869 5216
## F3D141_S207_L001_R1_001.fastq 5958 5400
## F3D142_S208_L001_R1_001.fastq 3183 2897
## F3D143_S209_L001_R1_001.fastq 3178 2912
## F3D144_S210_L001_R1_001.fastq 4827 4262
You can see some reads were removed from the fastq file!
and they are now stored in a new directory.
list.files(path = "/Users/minsikkim/Dropbox (Personal)/Inha/5_Lectures/Advanced metagenomics/IBS7048_dataset/filtered/")
## [1] "F3D0_F_filt.fastq.gz" "F3D0_R_filt.fastq.gz" "F3D1_F_filt.fastq.gz"
## [4] "F3D1_R_filt.fastq.gz" "F3D141_F_filt.fastq.gz" "F3D141_R_filt.fastq.gz"
## [7] "F3D142_F_filt.fastq.gz" "F3D142_R_filt.fastq.gz" "F3D143_F_filt.fastq.gz"
## [10] "F3D143_R_filt.fastq.gz" "F3D144_F_filt.fastq.gz" "F3D144_R_filt.fastq.gz"
## [13] "F3D145_F_filt.fastq.gz" "F3D145_R_filt.fastq.gz" "F3D146_F_filt.fastq.gz"
## [16] "F3D146_R_filt.fastq.gz" "F3D147_F_filt.fastq.gz" "F3D147_R_filt.fastq.gz"
## [19] "F3D148_F_filt.fastq.gz" "F3D148_R_filt.fastq.gz" "F3D149_F_filt.fastq.gz"
## [22] "F3D149_R_filt.fastq.gz" "F3D150_F_filt.fastq.gz" "F3D150_R_filt.fastq.gz"
## [25] "F3D2_F_filt.fastq.gz" "F3D2_R_filt.fastq.gz" "F3D3_F_filt.fastq.gz"
## [28] "F3D3_R_filt.fastq.gz" "F3D5_F_filt.fastq.gz" "F3D5_R_filt.fastq.gz"
## [31] "F3D6_F_filt.fastq.gz" "F3D6_R_filt.fastq.gz" "F3D7_F_filt.fastq.gz"
## [34] "F3D7_R_filt.fastq.gz" "F3D8_F_filt.fastq.gz" "F3D8_R_filt.fastq.gz"
## [37] "F3D9_F_filt.fastq.gz" "F3D9_R_filt.fastq.gz" "Mock_F_filt.fastq.gz"
## [40] "Mock_R_filt.fastq.gz"
Error prediction
The DADA2 algorithm makes use of a parametric error model (err) and every amplicon dataset has a different set of error rates. The learnErrors method learns this error model from the data, by alternating estimation of the error rates and inference of sample composition until they converge on a jointly consistent solution. As in many machine-learning problems, the algorithm must begin with an initial guess, for which the maximum possible error rates in this data are used (the error rates if only the most abundant sequence is correct and all the rest are errors).
error_F <- learnErrors(filtered_forward_reads
#, multithread=TRUE #
)
## 34523750 total bases in 138095 reads from 20 samples will be used for learning the error rates.
error_R <- learnErrors(filtered_reverse_reads
#, multithread=TRUE
)
## 22095200 total bases in 138095 reads from 20 samples will be used for learning the error rates.
We can double-chekc when the error is ocurring in our sequencing
file, using plotErrors()
function.
plotErrors(error_F, nominalQ=TRUE)
Using this predicted error
(which can be also called as
expected error
) in adjusting the actuall error happedn in
our data set. In other words, errors that is higher than the expectation
will be strongly filtered.
Use dada()
function to remove errors and get
unique sequences!
forward_dada <- dada(filtered_forward_reads, err=error_F)
## Sample 1 - 7029 reads in 2157 unique sequences.
## Sample 2 - 5216 reads in 1755 unique sequences.
## Sample 3 - 5400 reads in 1584 unique sequences.
## Sample 4 - 2897 reads in 976 unique sequences.
## Sample 5 - 2912 reads in 1004 unique sequences.
## Sample 6 - 4262 reads in 1362 unique sequences.
## Sample 7 - 6656 reads in 1872 unique sequences.
## Sample 8 - 4499 reads in 1547 unique sequences.
## Sample 9 - 15469 reads in 3893 unique sequences.
## Sample 10 - 11270 reads in 2989 unique sequences.
## Sample 11 - 11875 reads in 3276 unique sequences.
## Sample 12 - 4985 reads in 1684 unique sequences.
## Sample 13 - 17906 reads in 4052 unique sequences.
## Sample 14 - 6184 reads in 1595 unique sequences.
## Sample 15 - 4007 reads in 1305 unique sequences.
## Sample 16 - 7292 reads in 1996 unique sequences.
## Sample 17 - 4724 reads in 1263 unique sequences.
## Sample 18 - 4817 reads in 1495 unique sequences.
## Sample 19 - 6452 reads in 1871 unique sequences.
## Sample 20 - 4243 reads in 937 unique sequences.
#multithread=TRUE for windows
reverse_dada <- dada(filtered_reverse_reads, err=error_R)
## Sample 1 - 7029 reads in 1622 unique sequences.
## Sample 2 - 5216 reads in 1311 unique sequences.
## Sample 3 - 5400 reads in 1311 unique sequences.
## Sample 4 - 2897 reads in 843 unique sequences.
## Sample 5 - 2912 reads in 866 unique sequences.
## Sample 6 - 4262 reads in 1265 unique sequences.
## Sample 7 - 6656 reads in 1765 unique sequences.
## Sample 8 - 4499 reads in 1234 unique sequences.
## Sample 9 - 15469 reads in 3355 unique sequences.
## Sample 10 - 11270 reads in 2462 unique sequences.
## Sample 11 - 11875 reads in 2704 unique sequences.
## Sample 12 - 4985 reads in 1389 unique sequences.
## Sample 13 - 17906 reads in 3215 unique sequences.
## Sample 14 - 6184 reads in 1365 unique sequences.
## Sample 15 - 4007 reads in 1108 unique sequences.
## Sample 16 - 7292 reads in 1598 unique sequences.
## Sample 17 - 4724 reads in 1062 unique sequences.
## Sample 18 - 4817 reads in 1135 unique sequences.
## Sample 19 - 6452 reads in 1472 unique sequences.
## Sample 20 - 4243 reads in 707 unique sequences.
#multithread=TRUE for windows
forward_dada[[1]]
## dada-class: object describing DADA2 denoising results
## 124 sequence variants were inferred from 2157 input unique sequences.
## Key parameters: OMEGA_A = 1e-40, OMEGA_C = 1e-40, BAND_SIZE = 16
Now, we have sequencing variants
after considering
errors.
Mering paired reads
Unique sequences can be merged into single, complementary reads
(forward + reverse) using mergePairs()
function.
mergers <- mergePairs(forward_dada,
filtered_forward_reads,
reverse_dada,
filtered_reverse_reads,
verbose=TRUE)
# Inspect the merger data.frame from the first sample
head(mergers[[1]])
## sequence
## 1 TACGGAGGATGCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGTGCGCAGGCGGAAGATCAAGTCAGCGGTAAAATTGAGAGGCTCAACCTCTTCGAGCCGTTGAAACTGGTTTTCTTGAGTGAGCGAGAAGTATGCGGAATGCGTGGTGTAGCGGTGAAATGCATAGATATCACGCAGAACTCCGATTGCGAAGGCAGCATACCGGCGCTCAACTGACGCTCATGCACGAAAGTGTGGGTATCGAACAGG
## 2 TACGGAGGATGCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGTGCGTAGGCGGCCTGCCAAGTCAGCGGTAAAATTGCGGGGCTCAACCCCGTACAGCCGTTGAAACTGCCGGGCTCGAGTGGGCGAGAAGTATGCGGAATGCGTGGTGTAGCGGTGAAATGCATAGATATCACGCAGAACCCCGATTGCGAAGGCAGCATACCGGCGCCCTACTGACGCTGAGGCACGAAAGTGCGGGGATCAAACAGG
## 3 TACGGAGGATGCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGTGCGTAGGCGGGCTGTTAAGTCAGCGGTCAAATGTCGGGGCTCAACCCCGGCCTGCCGTTGAAACTGGCGGCCTCGAGTGGGCGAGAAGTATGCGGAATGCGTGGTGTAGCGGTGAAATGCATAGATATCACGCAGAACTCCGATTGCGAAGGCAGCATACCGGCGCCCGACTGACGCTGAGGCACGAAAGCGTGGGTATCGAACAGG
## 4 TACGGAGGATGCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGTGCGTAGGCGGGCTTTTAAGTCAGCGGTAAAAATTCGGGGCTCAACCCCGTCCGGCCGTTGAAACTGGGGGCCTTGAGTGGGCGAGAAGAAGGCGGAATGCGTGGTGTAGCGGTGAAATGCATAGATATCACGCAGAACCCCGATTGCGAAGGCAGCCTTCCGGCGCCCTACTGACGCTGAGGCACGAAAGTGCGGGGATCGAACAGG
## 5 TACGGAGGATGCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGTGCGCAGGCGGACTCTCAAGTCAGCGGTCAAATCGCGGGGCTCAACCCCGTTCCGCCGTTGAAACTGGGAGCCTTGAGTGCGCGAGAAGTAGGCGGAATGCGTGGTGTAGCGGTGAAATGCATAGATATCACGCAGAACTCCGATTGCGAAGGCAGCCTACCGGCGCGCAACTGACGCTCATGCACGAAAGCGTGGGTATCGAACAGG
## 6 TACGGAGGATGCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGTGCGTAGGCGGGATGCCAAGTCAGCGGTAAAAAAGCGGTGCTCAACGCCGTCGAGCCGTTGAAACTGGCGTTCTTGAGTGGGCGAGAAGTATGCGGAATGCGTGGTGTAGCGGTGAAATGCATAGATATCACGCAGAACTCCGATTGCGAAGGCAGCATACCGGCGCCCTACTGACGCTGAGGCACGAAAGCGTGGGTATCGAACAGG
## abundance forward reverse nmatch nmismatch nindel prefer accept
## 1 577 1 1 158 0 0 1 TRUE
## 2 467 2 2 158 0 0 2 TRUE
## 3 444 3 4 158 0 0 1 TRUE
## 4 427 4 3 158 0 0 2 TRUE
## 5 345 5 6 158 0 0 1 TRUE
## 6 282 6 5 158 0 0 2 TRUE
Sequence table
Table of sequences
seqtab <- makeSequenceTable(mergers)
dim(seqtab)
## [1] 20 291
20 fiels, 291 unique sequencing reads
Length of merged sequences
table(nchar(getSequences(seqtab)))
##
## 251 252 253 254 255
## 1 93 189 6 2
Remove chimeras
The core dada method corrects substitution and indel errors, but chimeras remain. Fortunately, the accuracy of sequence variants after denoising makes identifying chimeric ASVs simpler than when dealing with fuzzy OTUs. Chimeric sequences are identified if they can be exactly reconstructed by combining a left-segment and a right-segment from two more abundant “parent” sequences.
seqtab.nochim <- removeBimeraDenovo(seqtab, method="consensus", multithread=TRUE, verbose=TRUE)
dim(seqtab.nochim)
## [1] 20 225
291-225 = 66 reads were chimeras.
sum(seqtab.nochim)/sum(seqtab)
## [1] 0.9612192
4% of merged readsa were chimeras.
Track reads through the pipeline
As a final check of our progress, we’ll look at the number of reads that made it through each step in the pipeline:
getN <- function(x) sum(getUniques(x))
track <- cbind(out,
sapply(forward_dada, getN),
sapply(reverse_dada, getN),
sapply(mergers, getN), rowSums(seqtab.nochim))
# If processing a single sample, remove the sapply calls: e.g. replace sapply(dadaFs, getN) with getN(dadaFs)
colnames(track) <- c("input", "filtered", "denoisedF", "denoisedR", "merged", "nonchim")
rownames(track) <- sample.names
head(track)
## input filtered denoisedF denoisedR merged nonchim
## F3D0 7793 7029 6867 6891 6446 6424
## F3D1 5869 5216 5079 5166 4862 4851
## F3D141 5958 5400 5252 5295 4926 4788
## F3D142 3183 2897 2768 2814 2569 2495
## F3D143 3178 2912 2776 2838 2487 2453
## F3D144 4827 4262 4122 4180 3640 3440
Assign taxonomy
It is common at this point, especially in 16S/18S/ITS amplicon sequencing, to assign taxonomy to the sequence variants. The DADA2 package provides a native implementation of the naive Bayesian classifier method for this purpose. The assignTaxonomy function takes as input a set of sequences to be classified and a training set of reference sequences with known taxonomy, and outputs taxonomic assignments with at least minBoot bootstrap confidence.
We maintain formatted training fastas for the RDP training set, GreenGenes clustered at 97% identity, and the Silva reference database, and additional trainings fastas suitable for protists and certain specific environments have been contributed. For fungal taxonomy, the General Fasta release files from the UNITE ITS database can be used as is. To follow along, download the silva_nr_v132_train_set.fa.gz file, and place it in the directory with the fastq files.
taxa <- assignTaxonomy(seqtab.nochim,
paste0(minsik_path, "/silva_nr99_v138.1_train_set.fa"),
multithread=TRUE)
view(taxa)
Read count information (to compare their abundances between taxa within sample)
view(seqtab.nochim)
Bibliography
## Computing. R Foundation for Statistical Computing, Vienna, Austria. <https://www.R-project.org/>. We have invested a lot of time and effort in creating R, please cite it when using it for data analysis. See also 'citation("pkgname")' for citing R packages.
## Grolemund G, Hayes A, Henry L, Hester J, Kuhn M, Pedersen TL, Miller E, Bache SM, Müller K, Ooms J, Robinson D, Seidel DP, Spinu V, Takahashi K, Vaughan D, Wilke C, Woo K, Yutani H (2019). "Welcome to the tidyverse." Journal of Open Source Software_, *4*(43), 1686. doi:10.21105/joss.01686 <https://doi.org/10.21105/joss.01686>.
## R. version 0.5.0. Buffalo, New York. http://github.com/trinker/pacman
## J, reikoch, Beasley W, O'Connor B, Warnes GR, Quinn M, Kamvar ZN (2023). yaml: Methods to Convert R Data to YAML and Back_. R package version 2.3.7, <https://CRAN.R-project.org/package=yaml>. ATTENTION: This citation information has been auto-generated from the package DESCRIPTION file and may need manual editing, see 'help("citation")'.