Load libraries for anlyisis

try http:// if https:// URLs are not supported

Randomization housekeeping

set.seed(100)

Visually check the quality of the reads

plotQualityProfile(fnFs)

plotQualityProfile(fnFs[1])

Everything looks good for forward read quality. The first few bp will be trimmed off during primer removal and the reads will be truncated to remove the low quality reads at the end. I had to play with the truncation length to preserve reads while maintaining good quality and taxonomic fit.

Create new files for filtered reads and then trim and filter

filt_path <- file.path(miseq_path, "filtered")
if(!file_test("-d",filt_path)) dir.create(filt_path)
filtFs <- file.path(filt_path, paste0(sampleNames, "_filt.fastq.gz"))
out <- filterAndTrim(fnFs, filtFs, 
                     maxLen = 600, #Removes reads greater than 600bp, since the whole ITS region should be this long
                     maxN=0, #After truncation, sequences with more than 0 Ns will be discarded
                     maxEE=3, #reads with higher than 30% 'expected errors' will be discareded EE=sum(10^(-Q/10))
                     truncQ=2, #Truncate reads at the first instance of a quality score less than or equal to 2
                     compress=TRUE, #makes it into a .zip file
                     trimLeft=24, #Removes the primers from the forward reads
                     truncLen = 200) #Cuts the reads to the desired length, discards all smaller reads
head(out) #preview headers for filtFs
                 reads.in reads.out
Faye01_049.fastq    47558     13085
Faye02_050.fastq    61547     11467
Faye03_051.fastq   144848     48407
Faye04_052.fastq   111871     25153
Faye05_053.fastq    46491      6615
Faye06_054.fastq   144300     31242

View plot of post-filtered sequence quality

plotQualityProfile(filtFs)

plotQualityProfile(filtFs[1])

The quality and reads look great. We dont have that dip in quality at the end after truncation and it appears that the primers have been sucessfully removed. Onwards!

Dereplication

Combining identical reads into ‘unique sequences’ with a corresponding ‘abundance’

drepFs <- derepFastq(filtFs, verbose=TRUE)
Dereplicating sequence entries in Fastq file: C:/Users/faysmith/Desktop/seq/filtered/Faye01_filt.fastq.gz
Encountered 3573 unique sequences from 13085 total sequences read.
Dereplicating sequence entries in Fastq file: C:/Users/faysmith/Desktop/seq/filtered/Faye02_filt.fastq.gz
Encountered 3672 unique sequences from 11467 total sequences read.
Dereplicating sequence entries in Fastq file: C:/Users/faysmith/Desktop/seq/filtered/Faye03_filt.fastq.gz
Encountered 20463 unique sequences from 48407 total sequences read.
Dereplicating sequence entries in Fastq file: C:/Users/faysmith/Desktop/seq/filtered/Faye04_filt.fastq.gz
Encountered 10300 unique sequences from 25153 total sequences read.
Dereplicating sequence entries in Fastq file: C:/Users/faysmith/Desktop/seq/filtered/Faye05_filt.fastq.gz
Encountered 3535 unique sequences from 6615 total sequences read.
Dereplicating sequence entries in Fastq file: C:/Users/faysmith/Desktop/seq/filtered/Faye06_filt.fastq.gz
Encountered 11500 unique sequences from 31242 total sequences read.
Dereplicating sequence entries in Fastq file: C:/Users/faysmith/Desktop/seq/filtered/Faye07_filt.fastq.gz
Encountered 3360 unique sequences from 14306 total sequences read.
Dereplicating sequence entries in Fastq file: C:/Users/faysmith/Desktop/seq/filtered/Faye08_filt.fastq.gz
Encountered 2059 unique sequences from 5515 total sequences read.
Dereplicating sequence entries in Fastq file: C:/Users/faysmith/Desktop/seq/filtered/Faye09_filt.fastq.gz
Encountered 2933 unique sequences from 4894 total sequences read.
Dereplicating sequence entries in Fastq file: C:/Users/faysmith/Desktop/seq/filtered/Faye10_filt.fastq.gz
Encountered 4384 unique sequences from 8144 total sequences read.
names(drepFs) <- sampleNames

This is a long list where we see how many unique sequences are in each sample. It looks like we just about halved each of the total sequences by clustering (dont worry, the associated abundance of each cluster is still kept for future analyses.)

Learn the error rates

Uses a statistical test for the notion that a specific sequence was seen too many times to have been caused by amplicon errors. Overly-abundant sequences are used as the seeds of new clusters. It uses estimated error rates that are inferred from the data. Error rates are leaned by alternating between sample inference and error rate estimation until convergence. After applying the learned error rates and then plotting:

errF <- learnErrors(filtFs, nreads=1e8)
Initializing error rates to maximum possible estimate.
Sample 1 - 13085 reads in 3573 unique sequences.
Sample 2 - 11467 reads in 3672 unique sequences.
Sample 3 - 48407 reads in 20463 unique sequences.
Sample 4 - 25153 reads in 10300 unique sequences.
Sample 5 - 6615 reads in 3535 unique sequences.
Sample 6 - 31242 reads in 11500 unique sequences.
Sample 7 - 14306 reads in 3360 unique sequences.
Sample 8 - 5515 reads in 2059 unique sequences.
Sample 9 - 4894 reads in 2933 unique sequences.
Sample 10 - 8144 reads in 4384 unique sequences.
   selfConsist step 2 
NA

Points are the observed error rates for each consensus quality score. The black line shows the estimated error rates after the learnError function converges. The red line is the error rates expected under the nominal definition of the Q-value. The black line with estimated rates fits well.

Infer the sequence variaents in each sample

Apply the sequence-barient inference algorithm to the dereplicated data

The algorithm inferred 30 real variants from the 3573 unique sequences in the first sample. The dada function removes sequencing error to reveal the members of the sequenced community.

Construct sequence table, a OTU table

fraction of chimeric sequences detected

Only around 3% chimeras in the de-replicated and filtered dataset - removed for further analysis

Make a data frame

Assign taxonomy

Assign taxonomy using local BLAST against the UNITE general FASTA release for Fungi

Seems to have worked well. All sequences have been named to at least the kingdom. I was getting much stranger results before.

Assign a phylogenetic tree to the amplicon sequences

Final data manipulation to run diversity analyses with

Create a file with all the treatment data included

Create the phyloseq object for futher analysis

save the phyloseq object for further anlysis

---
title: "Process sequences data for metabarcoding"
output: html_notebook
---


---

#Load libraries for anlyisis 
## try http:// if https:// URLs are not supported

```{r message=FALSE, warning=FALSE, include=FALSE, paged.print=FALSE}
# source("https://bioconductor.org/biocLite.R")
# biocLite()
# 
# biocLite("phangorn")
# biocLite("knitr")
# biocLite("BiocStyle")
# biocLite("DECIPHER")
# biocLite("dada2")

library(knitr)
library(BiocStyle)
library(DECIPHER)
library(survival)
library(phyloseq)
library(dada2)
library(phangorn)
library(msa)
```

Randomization housekeeping
```{r}
set.seed(100)
```

#Navigate to files and list file names

```{r}
miseq_path <- "C:/Users/faysmith/Desktop/seq"
fnFs <- list.files(miseq_path, pattern=".fastq")
sampleNames <- sapply(strsplit(fnFs, "_"), '[', 1)

fnFs <- file.path(miseq_path, fnFs)
```

#Visually check the quality of the reads

```{r}
plotQualityProfile(fnFs)
plotQualityProfile(fnFs[1])

```
Everything looks good for forward read quality. The first few bp will be trimmed off during primer removal and the reads will be truncated to remove the low quality reads at the end. I had to play with the truncation length to preserve reads while maintaining good quality and taxonomic fit. 

##Create new files for filtered reads and then trim and filter

```{r}
filt_path <- file.path(miseq_path, "filtered")
if(!file_test("-d",filt_path)) dir.create(filt_path)
filtFs <- file.path(filt_path, paste0(sampleNames, "_filt.fastq.gz"))

out <- filterAndTrim(fnFs, filtFs, 
                     maxLen = 600, #Removes reads greater than 600bp, since the whole ITS region should be this long
                     maxN=0, #After truncation, sequences with more than 0 Ns will be discarded
                     maxEE=3, #reads with higher than 30% 'expected errors' will be discareded EE=sum(10^(-Q/10))
                     truncQ=2, #Truncate reads at the first instance of a quality score less than or equal to 2
                     compress=TRUE, #makes it into a .zip file
                     trimLeft=24, #Removes the primers from the forward reads
                     truncLen = 200) #Cuts the reads to the desired length, discards all smaller reads

head(out) #preview headers for filtFs
```


#View plot of post-filtered sequence quality
```{r}
plotQualityProfile(filtFs)
plotQualityProfile(filtFs[1])
```

The quality and reads look great. We dont have that dip in quality at the end after truncation and it appears that the primers have been sucessfully removed. Onwards!

#Dereplication
Combining identical reads into 'unique sequences' with a corresponding 'abundance'

```{r}
drepFs <- derepFastq(filtFs, verbose=TRUE)
names(drepFs) <- sampleNames
```
This is a long list where we see how many unique sequences are in each sample. It looks like we just about halved each of the total sequences by clustering (dont worry, the associated abundance of each cluster is still kept for future analyses.)

#Learn the error rates
Uses a statistical test for the notion that a specific sequence was seen too many times to have been caused by amplicon errors. Overly-abundant sequences are used as the seeds of new clusters. It uses estimated error rates that are inferred from the data. Error rates are leaned by alternating between sample inference and error rate estimation until convergence. 
After applying the learned error rates and then plotting:


```{r message=FALSE, warning=FALSE}
errF <- learnErrors(filtFs, nreads=1e8)
plotErrors(errF, nominalQ=TRUE)

```

Points are the observed error rates for each consensus quality score. The black line shows the estimated error rates after the learnError function converges. The red line is the error rates expected under the nominal definition of the Q-value. The black line with estimated rates fits well. 

#Infer the sequence variaents in each sample
Apply the sequence-barient inference algorithm to the dereplicated data
```{r}
dadaFs <- dada(drepFs, err=errF)
dadaFs[[1]]
```
The algorithm inferred 30 real variants from the 3573 unique sequences in the first sample. 
The dada function removes sequencing error to reveal the members of the sequenced community. 


#Construct sequence table, a OTU table
```{r}
seqtab <- makeSequenceTable(dadaFs)
```

#fraction of chimeric sequences detected
```{r message=FALSE, warning=FALSE}
seqtab.nochimera <- removeBimeraDenovo(seqtab)
chimeras = 1-(sum(seqtab.nochimera)/sum(seqtab))
chimeras
```
Only around 3% chimeras in the de-replicated and filtered dataset - removed for further analysis

#Make a data frame
```{r}
df_seq = as.data.frame(seqtab.nochimera)
```

#Assign taxonomy 
Assign taxonomy using local BLAST against the UNITE general FASTA release for Fungi

```{r}
taxa <- assignTaxonomy(seqtab.nochimera, "C:/Users/faysmith/Desktop/seq/sh_general_release_dynamic_01.12.2017.fasta")
taxa.print <- taxa #removing sequence rownames for display only
rownames(taxa.print) <- NULL
head(taxa.print)
```
Seems to have worked well. All sequences have been named to at least the kingdom. I was getting much stranger results before.

#Assign a phylogenetic tree to the amplicon sequences
```{r}
seqs <- getSequences(seqtab.nochimera)
names(seqs) <- seqs
mult <- msa(seqs, method="ClustalW", type="dna", order="input")

phang.align <- as.phyDat(mult, type="DNA", names=getSequence(seqtab))
dm <- dist.ml(phang.align)
treeNJ <- NJ(dm) # Note, tip order != sequence order
fit = pml(treeNJ, data=phang.align)

## negative edges length changed to 0!

fitGTR <- update(fit, k=4, inv=0.2)
fitGTR <- optim.pml(fitGTR, model="GTR", optInv=TRUE, optGamma=TRUE,
                    rearrangement = "stochastic", control = pml.control(trace = 0))
detach("package:phangorn", unload=TRUE)

```


##Final data manipulation to run diversity analyses with
#Create a file with all the treatment data included
```{r message=FALSE, warning=FALSE}
samples.out <- rownames(seqtab.nochimera)
sample <- sapply(strsplit(samples.out, "D"), `[`, 1)
sam_rep <- c("1","1","2","2","1","1","2","2","1","1")
depth <- c("1","2","1","2","1","2","1","2","1","2")
area <- c(rep("Mound",4),rep("Depression",4),rep("Pasture",2))
samdf <- data.frame(SampleID=sample, Depth=depth, Area=area, Replication = sam_rep)
rownames(samdf) <- samples.out

```

##Create the phyloseq object for futher analysis
```{r message=FALSE, warning=FALSE}
ps <- phyloseq(otu_table(seqtab.nochimera, taxa_are_rows=FALSE),
               sample_data(samdf),
               tax_table(taxa)
               ,phy_tree(fitGTR$tree)
               )
ps

```

#save the phyloseq object for further anlysis
```{r message=FALSE, warning=FALSE}
saveRDS(ps, "C:/Users/faysmith/Desktop/seq/ps.rds")

```

