dada2

DADA2 pipeline

From now on, we are going to run the code of the dada2 pipeline. First we need to define the path where the files, after unzipping are stored.

path <- "/Users/gabri/Desktop/lesson_microbiome/example_files" # CHANGE ME to the directory containing the fastq files after unzipping.
list.files(path)

##  [1] "control_1_R1.fastq.gz"  "control_1_R2.fastq.gz"  "control_2_R1.fastq.gz" 
##  [4] "control_2_R2.fastq.gz"  "control_3_R1.fastq.gz"  "control_3_R2.fastq.gz" 
##  [7] "control_4_R1.fastq.gz"  "control_4_R2.fastq.gz"  "control_5_R1.fastq.gz" 
## [10] "control_5_R2.fastq.gz"  "high_fat_1_R1.fastq.gz" "high_fat_1_R2.fastq.gz"
## [13] "high_fat_2_R1.fastq.gz" "high_fat_2_R2.fastq.gz" "high_fat_3_R1.fastq.gz"
## [16] "high_fat_3_R2.fastq.gz" "high_fat_4_R1.fastq.gz" "high_fat_4_R2.fastq.gz"
## [19] "high_fat_5_R1.fastq.gz" "high_fat_5_R2.fastq.gz"

We can see, we have 20 files, that is because we are working with 10 samples and using a paired sequencing approach. That means we are going to have R1 (forward) and R2 (reverse) reads for each sample.

Continuing with the tutorial, we create a list of forward and reverse reads and extract single sample names.

# Forward and reverse fastq filenames 
fnFs <- sort(list.files(path, pattern="_R1.fastq.gz", full.names = TRUE))
fnRs <- sort(list.files(path, pattern="_R2.fastq.gz", full.names = TRUE))
# Extract sample names, assuming filenames have format: SAMPLENAME_XXX.fastq
sample.names <- sapply(strsplit(basename(fnFs), "_R"), `[`, 1)

Inspect read quality profile

plotQualityProfile(fnFs[1:2]) #forward reads

## Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
## of ggplot2 3.3.4.
## ℹ The deprecated feature was likely used in the dada2 package.
##   Please report the issue at <]8;;https://github.com/benjjneb/dada2/issueshttps://github.com/benjjneb/dada2/issues]8;;>.

plotQualityProfile(fnRs[1:2]) #reverse reads

The sequencing quality is overall good, reverse have a slightly lower quality, this is normal in illumina sequencing. We are then going to use the standard settings to quality filter our reads, Trim forwards at 145 and reverses at 145. and remove the first 10 bases on the left.

BE CAREFUL, On Windows machines set multithread=FALSE

# Place filtered files in filtered/ subdirectory
filtFs <- file.path(path, "filtered", paste0(sample.names, "_F_filt.fastq.gz"))
filtRs <- file.path(path, "filtered", paste0(sample.names, "_R_filt.fastq.gz"))
names(filtFs) <- sample.names
names(filtRs) <- sample.names
# Trim forwards at 150 and reverses at 150 based on quality plots; also trim first 10 bp as recommended
# After this completes, check "out" to see if filtering was too stringent
out <- filterAndTrim(fnFs, filtFs, fnRs, filtRs, truncLen=c(145,145),
                     maxN=0, maxEE=c(2,2), truncQ=2, rm.phix=TRUE,
                     compress=TRUE, multithread=TRUE, trimLeft = 10) # On Windows set multithread=FALSE

## Creating output directory: /Users/gabri/Desktop/lesson_microbiome/example_files/filtered

We can now check for each read how many reads were filtered out. ps. dada2 always works with paired samples.

out

##                        reads.in reads.out
## control_1_R1.fastq.gz    259472    249588
## control_2_R1.fastq.gz    167484    161954
## control_3_R1.fastq.gz    199819    192921
## control_4_R1.fastq.gz    212683    203153
## control_5_R1.fastq.gz    190181    180723
## high_fat_1_R1.fastq.gz   107190    104261
## high_fat_2_R1.fastq.gz   324409    314283
## high_fat_3_R1.fastq.gz   158848    153177
## high_fat_4_R1.fastq.gz   244527    235295
## high_fat_5_R1.fastq.gz    96212     93607

(out[,1]-out[,2])/out[,1]*100 #ca 4 percent of each file pair, very good!

##  control_1_R1.fastq.gz  control_2_R1.fastq.gz  control_3_R1.fastq.gz 
##               3.809274               3.301808               3.452124 
##  control_4_R1.fastq.gz  control_5_R1.fastq.gz high_fat_1_R1.fastq.gz 
##               4.480847               4.973157               2.732531 
## high_fat_2_R1.fastq.gz high_fat_3_R1.fastq.gz high_fat_4_R1.fastq.gz 
##               3.121368               3.570080               3.775452 
## high_fat_5_R1.fastq.gz 
##               2.707562

We filtered out ca 4% of the reads, that is a very good results, it means that reads have a high quality

Denoising

We proceed now with denoising. We need first to learn the error rates.

errF <- learnErrors(filtFs, multithread=TRUE)

## 109028160 total bases in 807616 reads from 4 samples will be used for learning the error rates.

errR <- learnErrors(filtRs, multithread=TRUE)

## 109028160 total bases in 807616 reads from 4 samples will be used for learning the error rates.

plotErrors(errF, nominalQ=TRUE)

## Warning: Transformation introduced infinite values in continuous y-axis
## Transformation introduced infinite values in continuous y-axis

plotErrors(errR, nominalQ=TRUE)

## Warning: Transformation introduced infinite values in continuous y-axis

The error model is good, the red line follows more or less the trend of the black line. We can use this error model to denoise our samples.

dadaFs <- dada(filtFs, err=errF, multithread=TRUE)

## Sample 1 - 249588 reads in 45618 unique sequences.
## Sample 2 - 161954 reads in 32134 unique sequences.
## Sample 3 - 192921 reads in 36021 unique sequences.
## Sample 4 - 203153 reads in 42080 unique sequences.
## Sample 5 - 180723 reads in 39932 unique sequences.
## Sample 6 - 104261 reads in 21243 unique sequences.
## Sample 7 - 314283 reads in 50504 unique sequences.
## Sample 8 - 153177 reads in 32678 unique sequences.
## Sample 9 - 235295 reads in 46262 unique sequences.
## Sample 10 - 93607 reads in 19028 unique sequences.

dadaRs <- dada(filtRs, err=errR, multithread=TRUE)

## Sample 1 - 249588 reads in 74036 unique sequences.
## Sample 2 - 161954 reads in 49791 unique sequences.
## Sample 3 - 192921 reads in 46880 unique sequences.
## Sample 4 - 203153 reads in 58700 unique sequences.
## Sample 5 - 180723 reads in 52178 unique sequences.
## Sample 6 - 104261 reads in 26688 unique sequences.
## Sample 7 - 314283 reads in 73160 unique sequences.
## Sample 8 - 153177 reads in 49368 unique sequences.
## Sample 9 - 235295 reads in 66271 unique sequences.
## Sample 10 - 93607 reads in 29168 unique sequences.

Merge paired reads, remove chimeras

mergers <- mergePairs(dadaFs, filtFs, dadaRs, filtRs, verbose=TRUE)

## 238764 paired-reads (in 551 unique pairings) successfully merged out of 246552 (in 4095 pairings) input.

## 153407 paired-reads (in 539 unique pairings) successfully merged out of 159538 (in 3510 pairings) input.

## 184918 paired-reads (in 535 unique pairings) successfully merged out of 190150 (in 2977 pairings) input.

## 190043 paired-reads (in 572 unique pairings) successfully merged out of 199107 (in 4535 pairings) input.

## 168996 paired-reads (in 527 unique pairings) successfully merged out of 176895 (in 4282 pairings) input.

## 98260 paired-reads (in 509 unique pairings) successfully merged out of 102446 (in 2664 pairings) input.

## 301568 paired-reads (in 568 unique pairings) successfully merged out of 310901 (in 4681 pairings) input.

## 141681 paired-reads (in 682 unique pairings) successfully merged out of 149711 (in 4624 pairings) input.

## 221755 paired-reads (in 611 unique pairings) successfully merged out of 231525 (in 5410 pairings) input.

## 87080 paired-reads (in 511 unique pairings) successfully merged out of 91679 (in 2905 pairings) input.

seqtab <- makeSequenceTable(mergers)
seqtab.nochim <- removeBimeraDenovo(seqtab, method="consensus", multithread=TRUE, verbose=TRUE)

## Identified 1791 bimeras out of 2608 input sequences.

Assign taxonomy

We run two functions to assign taxonomy, the first one uses the RDP classifier and assign until species level, the second one uses exact matches and assign species level.

taxa <- assignTaxonomy(seqtab.nochim, "silva_genus.fa.gz", multithread=TRUE)
taxa <- addSpecies(taxa, "silva_species.fa.gz")

Pipeline statistics

Here we are going to see how many reads are retained after each step.

getN <- function(x) sum(getUniques(x))
track <- cbind(out, sapply(dadaFs, getN), sapply(dadaRs, getN), sapply(mergers, getN), rowSums(seqtab.nochim))
# If processing a single sample, remove the sapply calls: e.g. replace sapply(dadaFs, getN) with getN(dadaFs)
colnames(track) <- c("input", "filtered", "denoisedF", "denoisedR", "merged", "nonchim")
rownames(track) <- sample.names
head(track)

##             input filtered denoisedF denoisedR merged nonchim
## control_1  259472   249588    248215    247693 238764  236131
## control_2  167484   161954    160880    160446 153407  150009
## control_3  199819   192921    191428    191413 184918  180093
## control_4  212683   203153    200895    201108 190043  186988
## control_5  190181   180723    178652    178628 168996  165807
## high_fat_1 107190   104261    103320    103298  98260   93403

save the results

We are going to save the results in two formats, in a txt file, that you can open with any program, even with excel, and the RDS format, which is the R format to store objects.

###save all tables
write.table(taxa, "taxa_table_silva.txt", sep = "\t", quote = F)
write.table(seqtab.nochim, "asv_table.txt", sep = "\t", quote = F)
write.table(track, "sequence_pipeline_stats.txt", sep = "\t", quote = F)
#save tables in RDS format 
saveRDS(taxa, "taxa_table_silva.RDS")
saveRDS(seqtab.nochim, "asv_table.RDS")

dada2

Gabri

2022-11-04

Load the packages

Set a working directory