load packages

library(knitr)
library(BiocManager)
library(ShortRead)
library("dada2")
library(dplyr)
library(janitor)
library(vegan)
library(data.table)
library(ggplot2)
library(ggpubr)
library(dplyr)
library(tidyverse)
library(DT)
library(RgoogleMaps)
library(mapplots)

End Product

My primary end product is a table with species designations of ASVs and Read numbers for each species for each sample. This table can be used as the basis for any statistical analysis or plots to be made in the future.

visualize read table

final_table <- read.csv("/Users/hailaschultz/GitHub/haila-ZoopMetabarcoding/output/read_table.csv")
datatable(final_table, options = list(scrollX = T,scrollY=T,pageLength = 10,
  lengthMenu = c(10, 20,50,100)))

Project Summary

The report is located here: https://rpubs.com/HailaSchultz/full-pipeline-report

This project utilizes metabarcoding and the LrCOI marker to taxonomically identify zooplankton samples from Washington Ocean Acidification Center Cruises. Samples were collected from seven stations spread throughout Puget Sound in April, July, and September 2018-2020. Zooplankton samples were collected using a 200 um net towed vertically from 10 meters from the bottom of the water column to the surface. A 5 ml subsample (settled volume) of each sample was homogenized, dried, and subsampled before being sent off to Ohio State University for library prep and sequencing.

Map of the seven sites visited

This workflow is adapted from the following pipeline: https://benjjneb.github.io/dada2/ITS_workflow.html

Download Data

This is step 1 of my project workflow. In this step, I am organizing my fastq files and checking their hash values. I downloaded my data from a shared google drive folder onto my computer and then uploaded into RStudio. Originally, the fastq files were located in multiple layers of directories.

Install dada2 package using BiocManager

install.packages('ShortRead')
install.packages("BiocManager")
BiocManager::install("dada2")

Move files into one folder

move multiple files from multiple layers of directories into the data directory

cd ../data
find . -name '*.gz' -exec mv {} ../data/fastq-files/ \;

create checksums file for all fastq files in data folder

cd /Users/hailaschultz/GitHub/haila-ZoopMetabarcoding/data/
shasum ../data/fastq-files/*.gz > fastq_checksums.sha
head fastq_checksums.sha

## 086979b069d473f74906329a83328e67d19d5d77  ../data/fastq-files/2018APRP12Ve-COI-48samples_S145_L001_R1_001.fastq.gz
## 7df8c51ee4e5d8108057236af4932183f6829080  ../data/fastq-files/2018APRP12Ve-COI-48samples_S145_L001_R2_001.fastq.gz
## c3faf8dcb35ceb847f155886ada5e013032d3725  ../data/fastq-files/2018APRP12VeA-COI-P1_S33_L001_R1_001.fastq.gz
## 48466cba94926b3f54c6f51c0e133dd77f6529fd  ../data/fastq-files/2018APRP12VeA-COI-P1_S33_L001_R2_001.fastq.gz
## 6d369cbae9ce9959ca063ab7f82467fb3962b973  ../data/fastq-files/2018APRP22Ve-COI-48samples_S146_L001_R1_001.fastq.gz
## 7b171b3594a341a2ab6367be2281711b62ebadbc  ../data/fastq-files/2018APRP22Ve-COI-48samples_S146_L001_R2_001.fastq.gz
## 9304abcd1c332881d215efe8e0138a4dccbe14c5  ../data/fastq-files/2018APRP22VeA-COI-P1_S45_L001_R1_001.fastq.gz
## d4d782733bbf8596ea9f4258b7e5c4312022a372  ../data/fastq-files/2018APRP22VeA-COI-P1_S45_L001_R2_001.fastq.gz
## c17f693c662e0322a1ad0bbac58c94d46d710333  ../data/fastq-files/2018APRP28Ve-COI-48samples_S147_L001_R1_001.fastq.gz
## 65bff1db647d30419d3b30fb1707b0c4d82d5202  ../data/fastq-files/2018APRP28Ve-COI-48samples_S147_L001_R2_001.fastq.gz

check if files match original

cd ../data
shasum -c fastq_checksums.sha > keys.txt
head keys.txt

## ../data/fastq-files/2018APRP12Ve-COI-48samples_S145_L001_R1_001.fastq.gz: OK
## ../data/fastq-files/2018APRP12Ve-COI-48samples_S145_L001_R2_001.fastq.gz: OK
## ../data/fastq-files/2018APRP12VeA-COI-P1_S33_L001_R1_001.fastq.gz: OK
## ../data/fastq-files/2018APRP12VeA-COI-P1_S33_L001_R2_001.fastq.gz: OK
## ../data/fastq-files/2018APRP22Ve-COI-48samples_S146_L001_R1_001.fastq.gz: OK
## ../data/fastq-files/2018APRP22Ve-COI-48samples_S146_L001_R2_001.fastq.gz: OK
## ../data/fastq-files/2018APRP22VeA-COI-P1_S45_L001_R1_001.fastq.gz: OK
## ../data/fastq-files/2018APRP22VeA-COI-P1_S45_L001_R2_001.fastq.gz: OK
## ../data/fastq-files/2018APRP28Ve-COI-48samples_S147_L001_R1_001.fastq.gz: OK
## ../data/fastq-files/2018APRP28Ve-COI-48samples_S147_L001_R2_001.fastq.gz: OK

Remove Primers with Cutadapt

To get the cutadapt software, I downloaded the miniconda installer using curl

#download miniconda from url
cd ../software
curl -O https://repo.anaconda.com/miniconda/Miniconda3-py310_23.3.1-0-MacOSX-arm64.sh

I then moved to the directory it was in by typing cd software in the terminal. I then typed bash Miniconda3-py310_23.3.1-0-MacOSX-arm64.sh in the terminal to start the setup process. Conda is located in: /Users/hailaschultz/miniconda3.

I checked that installation worked by typing conda list in the terminal. In the terminal I then typed: conda config --add channels defaults conda config --add channels bioconda conda config --add channels conda-forge conda config --set channel_priority strict

and then conda create -n cutadaptenv cutadapt

On my personal computer, I had to use CONDA_SUBDIR=osx-64 conda create -n cutadaptenv cutadapt

I ran conda init bash

To activate the conda environment, in the terminal I typed conda activate cutadaptenv

Prepare data

First, I wanted to check the quality of the raw data to compare with future steps in the pipeline. download fastqc

curl -O https://www.bioinformatics.babraham.ac.uk/projects/fastqc/fastqc_v0.11.7.zip

use fastqc

cd /Users/hailaschultz/GitHub/haila-ZoopMetabarcoding/data/fastq-files
/Users/hailaschultz/GitHub/haila-ZoopMetabarcoding/software/FastQC/fastqc *fastq.gz -o /Users/hailaschultz/GitHub/haila-ZoopMetabarcoding/output/raw-read-qc

make multiqc report in the terminal I ran conda install multiqc to install multiqc

cd /Users/hailaschultz/GitHub/haila-ZoopMetabarcoding/output/raw-read-qc

multiqc .

screenshot of multiqc mean quality scores

I then mad lists of the file names in order to process multiple files at once. I also specified the primer sequences used in this prject.

#set file path
path <- "../data/fastq-files"
#make matched list of forward and reverse reads
fnFs <- sort(list.files(path, pattern = "_R1_001.fastq.gz", full.names = TRUE))
fnRs <- sort(list.files(path, pattern = "_R2_001.fastq.gz", full.names = TRUE))

#specify primer sequences
FWD <- "GGWACWGGWTGAACWGTWTAYCCYCC"  
REV <- "TANACYTCNGGRTGNCCRAARAAYCA"  
#in the reverse primer, I replaced I with N

get orientations of primers

allOrients <- function(primer) {
    # Create all orientations of the input sequence
    require(Biostrings)
    dna <- DNAString(primer)  # The Biostrings works w/ DNAString objects rather than character vectors
    orients <- c(Forward = dna, Complement = complement(dna), Reverse = reverse(dna), 
        RevComp = reverseComplement(dna))
    return(sapply(orients, toString))  # Convert back to character vector
}
FWD.orients <- allOrients(FWD)
REV.orients <- allOrients(REV)
FWD.orients

##                      Forward                   Complement 
## "GGWACWGGWTGAACWGTWTAYCCYCC" "CCWTGWCCWACTTGWCAWATRGGRGG" 
##                      Reverse                      RevComp 
## "CCYCCYATWTGWCAAGTWGGWCAWGG" "GGRGGRTAWACWGTTCAWCCWGTWCC"

I then checked if the primers were actually present in my files. The followincg code counts the number of times primers occur in the first sample.

primerHits <- function(primer, fn) {
    # Counts number of reads in which the primer is found
    nhits <- vcountPattern(primer, sread(readFastq(fn)), fixed = FALSE)
    return(sum(nhits > 0))
}
rbind(FWD.ForwardReads = sapply(FWD.orients, primerHits, fn = fnFs[[1]]), 
    FWD.ReverseReads = sapply(FWD.orients, primerHits, fn = fnRs[[1]]), 
    REV.ForwardReads = sapply(REV.orients, primerHits, fn = fnFs[[1]]), 
    REV.ReverseReads = sapply(REV.orients, primerHits, fn = fnRs[[1]]))

##                  Forward Complement Reverse RevComp
## FWD.ForwardReads   63428          0       0       0
## FWD.ReverseReads       0          0       0       0
## REV.ForwardReads       0          0       0       0
## REV.ReverseReads   60704          0       0       0

It looks like the forward and reverse primers are present in the forward and reverse reads.

Use Cutadapt

Tell R the path to the cutadapt software

cutadapt <- "/Users/hailaschultz/miniconda3/envs/cutadaptenv/bin/cutadapt" 
#verify that r knows where to find teh software by asking for the version
system2(cutadapt, args = "--version")

Tell cutadapt what to cut

path.cut <- file.path(path, "cutadapt")
if(!dir.exists(path.cut)) dir.create(path.cut)
fnFs.cut <- file.path(path.cut, basename(fnFs))
fnRs.cut <- file.path(path.cut, basename(fnRs))

FWD.RC <- dada2:::rc(FWD)
REV.RC <- dada2:::rc(REV)
# Trim FWD and the reverse-complement of REV off of R1 (forward reads)
R1.flags <- paste("-g", FWD, "-a", REV.RC) 
# Trim REV and the reverse-complement of FWD off of R2 (reverse reads)
R2.flags <- paste("-G", REV, "-A", FWD.RC)

Run Cutadapt

for(i in seq_along(fnFs)) {
  system2(cutadapt, args = c(R1.flags, R2.flags, "-n", 2, # -n 2 required to remove FWD and REV from reads
                             "-o", fnFs.cut[i], "-p", fnRs.cut[i], # output files
                             fnFs[i], fnRs[i])) # input files
}

Check if adapters were actuall trimmed

rbind(FWD.ForwardReads = sapply(FWD.orients, primerHits, fn = fnFs.cut[[1]]), 
    FWD.ReverseReads = sapply(FWD.orients, primerHits, fn = fnRs.cut[[1]]), 
    REV.ForwardReads = sapply(REV.orients, primerHits, fn = fnFs.cut[[1]]), 
    REV.ReverseReads = sapply(REV.orients, primerHits, fn = fnRs.cut[[1]]))

##                  Forward Complement Reverse RevComp
## FWD.ForwardReads       0          0       0       0
## FWD.ReverseReads       0          0       0       0
## REV.ForwardReads       0          0       0       0
## REV.ReverseReads       5          0       0       0

There were a few reverse reads that were not trimmed, but overall it looks good. However, after trimming the primers, about half of the files ended up with zero data/reads. I think it is an issue with the read names. The error I get is: ERROR: Error in sequence file at unknown line: Reads are improperly paired. Read name ‘M02815:67:000000000-KDDDG:1:1101:17448:1214 1:N:0:CGGAGCCT+GTAAGGAG’ in file 1 does not match ‘M02815:67:000000000-KDDDG:1:1101:20493:1232 2:N:0:CGTACTAG+CGTCTAAT’ in file 2, etc.

Delete files that didn’t make it through cutadapt - smaller than 1 MB in the cutadapt directory

cd /Users/hailaschultz/GitHub/haila-ZoopMetabarcoding/data/fastq-files/cutadapt
find . -type f -name "*.gz" -size -1M -delete

QC check on files after primers were trimmed

use fastqc

cd /Users/hailaschultz/GitHub/haila-ZoopMetabarcoding/data/fastq-files/cutadapt
/Users/hailaschultz/GitHub/haila-ZoopMetabarcoding/software/FastQC/fastqc *fastq.gz -o /Users/hailaschultz/GitHub/haila-ZoopMetabarcoding/output/after-cutadapt-qc

make multiqc report to check files after trimming primers

cd /Users/hailaschultz/GitHub/haila-ZoopMetabarcoding/output/after-cutadapt-qc

multiqc .

screenshot of multiqc mean quality scores after cutadapt

Filter Reads

Inspect read quality

Get the sample names for the files that have been trimmed

# Forward and reverse fastq filenames have the format:
cutFs <- sort(list.files(path.cut, pattern = "_R1_001.fastq.gz", full.names = TRUE))
cutRs <- sort(list.files(path.cut, pattern = "_R2_001.fastq.gz", full.names = TRUE))

# Extract sample names, assuming filenames have format:
get.sample.name <- function(fname) strsplit(basename(fname), "_")[[1]][1]
sample.names <- unname(sapply(cutFs, get.sample.name))
head(sample.names)

## [1] "2018APRP12Ve-COI-48samples" "2018APRP12VeA-COI-P1"      
## [3] "2018APRP22Ve-COI-48samples" "2018APRP22VeA-COI-P1"      
## [5] "2018APRP28Ve-COI-48samples" "2018APRP28VeA-COI-P1"

plot forward read quality for first two files

plotQualityProfile(cutFs[1:2])

plot reverse read quality for first two files

plotQualityProfile(cutRs[1:2])

The forward reads are better quality, as we might expect.

Designate file names location for the filtered files

filtFs <- file.path(path.cut, "filtered", basename(cutFs))
filtRs <- file.path(path.cut, "filtered", basename(cutRs))

Before filtering, verify that all of your files have a matching pair. Both the forward and reverse reads for each sample are required for this step.

# get sample names to see which files match
get.sample.name <- function(fname) strsplit(basename(fname), "_")[[1]][1]
sample.namesF <- unname(sapply(cutFs, get.sample.name))
sample.namesR <- unname(sapply(cutRs, get.sample.name))

Check for differences between lists and remove files that don’t have a matching pair

difsR <- setdiff(sample.namesR,sample.namesF)
difsR

## character(0)

difsF <- setdiff(sample.namesF,sample.namesR)
difsF

## character(0)

It looks like there are no differences, so we can proceed

filter reads using quality thresholds

maxN: allows zero Ns max EE: set to 5 for forward and reverse truncQ: set to 2 minlen: minimum length is 100 reads rm.phix:remove phix Note that this step takes a long time

out <- filterAndTrim(cutFs, filtFs, cutRs, filtRs, maxN = 0, maxEE = c(5, 5), 
    truncQ = 2, minLen = 100, rm.phix = TRUE, compress = TRUE, multithread = TRUE)

Check how many samples went through filtering

cd /Users/hailaschultz/GitHub/haila-ZoopMetabarcoding/data/fastq-files/cutadapt

ls | wc -l

cd /Users/hailaschultz/GitHub/haila-ZoopMetabarcoding/data/fastq-files/cutadapt/filtered

ls | wc -l

##       97
##       96

learn the error rates

forward read error rates

errF <- learnErrors(filtFs, multithread=TRUE)

## 118001220 total bases in 432610 reads from 8 samples will be used for learning the error rates.

reverse read error rates

errR <- learnErrors(filtRs, multithread=TRUE)

## 118551091 total bases in 432610 reads from 8 samples will be used for learning the error rates.

plot the errors

plotErrors(errF, nominalQ=TRUE)

#looks okay! good to proceed

plotErrors(errR, nominalQ=TRUE)

#looks okay! good to proceed

QC check on files after filtering were trimmed

use fastqc

cd /Users/hailaschultz/GitHub/haila-ZoopMetabarcoding/data/fastq-files/cutadapt/filtered
/Users/hailaschultz/GitHub/haila-ZoopMetabarcoding/software/FastQC/fastqc *fastq.gz -o /Users/hailaschultz/GitHub/haila-ZoopMetabarcoding/output/after-filtering-qc

make multiqc report to check files after trimming primers

cd /Users/hailaschultz/GitHub/haila-ZoopMetabarcoding/output/after-filtering-qc

multiqc .

screenshot of multiqc mean quality scores after filtering

Pair Forward and Reverse Reads

Dereplication

This step takes a little while

derepFs <- derepFastq(filtFs, verbose=TRUE)
derepRs <- derepFastq(filtRs, verbose=TRUE)
# Name the derep-class objects by the sample names
names(derepFs) <- sample.names
names(derepRs) <- sample.names

Apply core sample inference algorithm

dadaFs <- dada(derepFs, err=errF, multithread=TRUE)

## Sample 1 - 53380 reads in 17256 unique sequences.
## Sample 2 - 71371 reads in 23411 unique sequences.
## Sample 3 - 39755 reads in 17518 unique sequences.
## Sample 4 - 57973 reads in 24827 unique sequences.
## Sample 5 - 17803 reads in 9422 unique sequences.
## Sample 6 - 43005 reads in 17527 unique sequences.
## Sample 7 - 73572 reads in 24346 unique sequences.
## Sample 8 - 75751 reads in 25518 unique sequences.
## Sample 9 - 98056 reads in 27273 unique sequences.
## Sample 10 - 76191 reads in 26240 unique sequences.
## Sample 11 - 111350 reads in 38178 unique sequences.
## Sample 12 - 19702 reads in 9230 unique sequences.
## Sample 13 - 43954 reads in 16795 unique sequences.
## Sample 14 - 55062 reads in 20106 unique sequences.
## Sample 15 - 75738 reads in 27839 unique sequences.
## Sample 16 - 50585 reads in 20748 unique sequences.
## Sample 17 - 31666 reads in 15173 unique sequences.
## Sample 18 - 54877 reads in 19944 unique sequences.
## Sample 19 - 51989 reads in 19995 unique sequences.
## Sample 20 - 60471 reads in 20567 unique sequences.
## Sample 21 - 59368 reads in 15549 unique sequences.
## Sample 22 - 66452 reads in 20503 unique sequences.
## Sample 23 - 47925 reads in 18563 unique sequences.
## Sample 24 - 60167 reads in 21126 unique sequences.
## Sample 25 - 67683 reads in 24503 unique sequences.
## Sample 26 - 50051 reads in 17031 unique sequences.
## Sample 27 - 81573 reads in 30100 unique sequences.
## Sample 28 - 29981 reads in 10539 unique sequences.
## Sample 29 - 28029 reads in 11744 unique sequences.
## Sample 30 - 34155 reads in 13168 unique sequences.
## Sample 31 - 30155 reads in 13727 unique sequences.
## Sample 32 - 13794 reads in 5809 unique sequences.
## Sample 33 - 24459 reads in 9974 unique sequences.
## Sample 34 - 56790 reads in 15822 unique sequences.
## Sample 35 - 81513 reads in 36376 unique sequences.
## Sample 36 - 95936 reads in 33051 unique sequences.
## Sample 37 - 61090 reads in 24304 unique sequences.
## Sample 38 - 44362 reads in 13688 unique sequences.
## Sample 39 - 57544 reads in 18866 unique sequences.
## Sample 40 - 53499 reads in 16883 unique sequences.
## Sample 41 - 31676 reads in 10532 unique sequences.
## Sample 42 - 43393 reads in 13493 unique sequences.
## Sample 43 - 71928 reads in 22277 unique sequences.
## Sample 44 - 23397 reads in 7994 unique sequences.
## Sample 45 - 25502 reads in 8731 unique sequences.
## Sample 46 - 75714 reads in 21533 unique sequences.
## Sample 47 - 74916 reads in 21520 unique sequences.
## Sample 48 - 60854 reads in 24756 unique sequences.

dadaRs <- dada(derepRs, err=errR, multithread=TRUE)

## Sample 1 - 53380 reads in 32871 unique sequences.
## Sample 2 - 71371 reads in 45076 unique sequences.
## Sample 3 - 39755 reads in 24242 unique sequences.
## Sample 4 - 57973 reads in 36323 unique sequences.
## Sample 5 - 17803 reads in 13096 unique sequences.
## Sample 6 - 43005 reads in 26214 unique sequences.
## Sample 7 - 73572 reads in 36970 unique sequences.
## Sample 8 - 75751 reads in 34450 unique sequences.
## Sample 9 - 98056 reads in 53913 unique sequences.
## Sample 10 - 76191 reads in 46465 unique sequences.
## Sample 11 - 111350 reads in 53411 unique sequences.
## Sample 12 - 19702 reads in 11521 unique sequences.
## Sample 13 - 43954 reads in 24136 unique sequences.
## Sample 14 - 55062 reads in 39868 unique sequences.
## Sample 15 - 75738 reads in 49706 unique sequences.
## Sample 16 - 50585 reads in 36477 unique sequences.
## Sample 17 - 31666 reads in 23794 unique sequences.
## Sample 18 - 54877 reads in 38824 unique sequences.
## Sample 19 - 51989 reads in 34222 unique sequences.
## Sample 20 - 60471 reads in 42821 unique sequences.
## Sample 21 - 59368 reads in 24020 unique sequences.
## Sample 22 - 66452 reads in 35367 unique sequences.
## Sample 23 - 47925 reads in 30913 unique sequences.
## Sample 24 - 60167 reads in 41059 unique sequences.
## Sample 25 - 67683 reads in 46762 unique sequences.
## Sample 26 - 50051 reads in 34700 unique sequences.
## Sample 27 - 81573 reads in 51233 unique sequences.
## Sample 28 - 29981 reads in 16666 unique sequences.
## Sample 29 - 28029 reads in 19035 unique sequences.
## Sample 30 - 34155 reads in 23534 unique sequences.
## Sample 31 - 30155 reads in 19035 unique sequences.
## Sample 32 - 13794 reads in 7944 unique sequences.
## Sample 33 - 24459 reads in 12034 unique sequences.
## Sample 34 - 56790 reads in 30034 unique sequences.
## Sample 35 - 81513 reads in 43030 unique sequences.
## Sample 36 - 95936 reads in 52721 unique sequences.
## Sample 37 - 61090 reads in 34848 unique sequences.
## Sample 38 - 44362 reads in 23177 unique sequences.
## Sample 39 - 57544 reads in 29463 unique sequences.
## Sample 40 - 53499 reads in 24299 unique sequences.
## Sample 41 - 31676 reads in 15476 unique sequences.
## Sample 42 - 43393 reads in 25007 unique sequences.
## Sample 43 - 71928 reads in 37035 unique sequences.
## Sample 44 - 23397 reads in 11245 unique sequences.
## Sample 45 - 25502 reads in 11927 unique sequences.
## Sample 46 - 75714 reads in 35334 unique sequences.
## Sample 47 - 74916 reads in 35689 unique sequences.
## Sample 48 - 60854 reads in 45922 unique sequences.

Merge Paired Reads

mergers <- mergePairs(dadaFs, derepFs, dadaRs, derepRs, verbose=TRUE)
# Inspect the merger data.frame from the first sample
head(mergers[[1]])

##                                                                                                                                                                                                                                                                                                                    sequence
## 1 TCTCGCAAGCAATCTTGCCCACGCTGGCGCCAGCGTGGACCTCACTATTTTCTCACTTCACTTAGCAGGCGTTTCCTCAATTCTGGGAGCAATTAATTTTATTACTACTATTATCAATATGAAGCCCCCTGCAATCTCACAATACCAGACACCCCTCTTTGTTTGATCCGTACTTATTACAGCTGTTCTTCTCCTACTCTCCCTGCCCGTCTTAGCCGCCGGCATCACAATACTACTAACTGACCGAAACCTCAACACCTCCTTCTTTGACCCCGCCGGAGGGGGAGACCCCATCCTATACCAGCACTTATTC
## 2 TTTATCTGCAGGAATTGCACATGCTGGGGCTTCTGTAGATATAGGAATCTTTTCACTACATATTGCTGGAGCCTCTTCTATTTTAGGAGCAGTTAATTTCATTACAACTGTAATTAATATGCGATCAGCAGGAATAACTATAGACCGTATCCCCTTATTCGTATGGTCTGTGTTTATTACAGCAATTTTACTTTTATTGTCTCTACCAGTTTTAGCTGGAGCAATCACAATACTTTTAACAGACCGTAATTTAAATACATCTTTCTTCGACCCTGCCGGGGGTGGTGACCCTATTTTATATCAACATCTATTC
## 3 TCTATCAGGTGCTCAAACTCATTCTGGAGGTTCTGTAGATATGGCTATTTTTAGTTTACATTGTGCAGGTGCTTCATCAATTATGGGTGCTATAAATTTTATTACCACTATTTTTAATATGAGAGCTCCTGGATTAACAATGGATAAATTACCATTATTTGTATGATCAGTTTTAATTACTGCTTTTCTATTATTATTATCTCTACCTGTACTAGCAGGTGCAATAACAATGTTATTAACAGACAGAAATTTCAATACAACATTTTTTGATCCAGCAGGTGGTGGTGATCCTGTTTTATATCAACATTTATTC
## 4 CTTATCAAGAAATATCGCCCATGCTGGCGCGTCTGTAGATTTTGCAATTTTTTCTCTCCATTTAGCAGGAGTAAGCTCGATCTTAGGCGCCGTGAATTTTATTAGAACCTTAGGAAATCTGCGAGTATTTGGAATAATTCTAGATCGTATGCCCCTGTTCGCATGGGCAGTTCTAATTACGGCTGTACTATTACTCCTCTCCTTGCCAGTTTTGGCAGGGGCTATCACAATACTTCTCACTGACCGTAACTTAAATACCACATTTTACGATGTGGGGGGGGGTGGAGATCCTATTCTATATCAACATTTATTT
## 5 ACTAAGCTCGATTCAGGCTCATTCCGGAGGTTCGGTGGATATGGCTATATTTAGTTTACATTTGGCAGGGGCTTCTTCTATCATGGGTGCTATTAATTTTATTACTACCATTCTGAACATGAGAGCCCCTGGAATGACAATGGATAAGATACCCCTATTTGTTTGGTCTGTATTAGTTACTGCAATATTATTATTGTTATCCTTGCCTGTTTTGGCTGGAGCAATAACCATGTTACTAACCGACAGGAACTTTAATACATCTTTTTTCGATCCTGCTGGGGGAGGAGACCCTATTTTGTTTCAACATCTTTTT
## 6 CTTATCAAGAAATATCGCCCATGCTGGCGCGTCTGTAGATTTTGCAATTTTTTCTCTCCATTTAGCAGGAGTAAGCTCGATCTTAGGCGCCGTGAATTTTATTAGAACCTTAGGAAATCTGCGAGTATTTGGAATAATTCTAGATCGTATGCCCCTGTTCGCATGGGCAGTTCTAATTACGGCTGTACTATTACTCCTCTCCTTGCCAGTTTTGGCAGGGGCTATCACAATACTTCTCACTGACCGTAACTTAAATACCACATTTTACGATGTGGGGGGGGGTGGAGATCCTATCCTATACCAACATTTATTT
##   abundance forward reverse nmatch nmismatch nindel prefer accept
## 1     17450       1       2    234         0      0      1   TRUE
## 2     11522       2       1    233         0      0      1   TRUE
## 3      3288       3       3    235         0      0      1   TRUE
## 4      2661       4       7    235         0      0      1   TRUE
## 5      1627       5       5    233         0      0      1   TRUE
## 6      1291       4      11    235         0      0      1   TRUE

Construct sequence table

seqtab <- makeSequenceTable(mergers)

inspect sequence length distributions eventually make histogram here

table(nchar(getSequences(seqtab)))

## 
##  151  206  215  234  253  255  265  268  269  283  286  288  291  294  295  296 
##    1    1    1    1    1    1    1    1    1    3    1    1    1    1    3    1 
##  297  298  300  301  302  303  304  306  307  308  309  310  311  312  313  314 
##    2    7    3    1    1    7    1    1    2    6    2  258   17   44 1526   17 
##  315  316  317  318  321  329  352 
##    7   29    6    1    1    1    1

Remove chimeras

seqtab.nochim <- removeBimeraDenovo(seqtab, method="consensus", multithread=TRUE, verbose=TRUE)

evaluate percentage of reads were chimeric

sum(seqtab.nochim)/sum(seqtab)

## [1] 0.9818287

chimeras account for very low perentage of reads

track how many reads made it through the pipeline

getN <- function(x) sum(getUniques(x))
track <- cbind(out, sapply(dadaFs, getN), sapply(dadaRs, getN), sapply(mergers, getN), rowSums(seqtab.nochim))

colnames(track) <- c("input", "filtered", "denoisedF", "denoisedR", "merged", "nonchim")
rownames(track) <- sample.names
head(track)

no step had a majority of the reads removed The step that creates the out table takes too long to run, so a screenshot of this table is included below

screenshot of tracking table

Blast

Download software from NCBI

#change directory
cd /Users/hailaschultz/GitHub/haila-ZoopMetabarcoding/software
#download software
curl -O https://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/https://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/ncbi-blast-2.14.0+-x64-macosx.tar.gz

unzip software

cd /Users/hailaschultz/GitHub/haila-ZoopMetabarcoding/software
tar -xzf ncbi-blast-2.14.0+-x64-macosx.tar.gz

Check if it’s working I had to go into my computer settings and give permissions to use blast because it is from and “unidentified developer”

/Users/hailaschultz/GitHub/haila-ZoopMetabarcoding/software/ncbi-blast-2.14.0+/bin/blastx -h

## USAGE
##   blastx [-h] [-help] [-import_search_strategy filename]
##     [-export_search_strategy filename] [-task task_name] [-db database_name]
##     [-dbsize num_letters] [-gilist filename] [-seqidlist filename]
##     [-negative_gilist filename] [-negative_seqidlist filename]
##     [-taxids taxids] [-negative_taxids taxids] [-taxidlist filename]
##     [-negative_taxidlist filename] [-ipglist filename]
##     [-negative_ipglist filename] [-entrez_query entrez_query]
##     [-db_soft_mask filtering_algorithm] [-db_hard_mask filtering_algorithm]
##     [-subject subject_input_file] [-subject_loc range] [-query input_file]
##     [-out output_file] [-evalue evalue] [-word_size int_value]
##     [-gapopen open_penalty] [-gapextend extend_penalty]
##     [-qcov_hsp_perc float_value] [-max_hsps int_value]
##     [-xdrop_ungap float_value] [-xdrop_gap float_value]
##     [-xdrop_gap_final float_value] [-searchsp int_value]
##     [-sum_stats bool_value] [-max_intron_length length] [-seg SEG_options]
##     [-soft_masking soft_masking] [-matrix matrix_name]
##     [-threshold float_value] [-culling_limit int_value]
##     [-best_hit_overhang float_value] [-best_hit_score_edge float_value]
##     [-subject_besthit] [-window_size int_value] [-ungapped] [-lcase_masking]
##     [-query_loc range] [-strand strand] [-parse_deflines]
##     [-query_gencode int_value] [-outfmt format] [-show_gis]
##     [-num_descriptions int_value] [-num_alignments int_value]
##     [-line_length line_length] [-html] [-sorthits sort_hits]
##     [-sorthsps sort_hsps] [-max_target_seqs num_sequences]
##     [-num_threads int_value] [-mt_mode int_value] [-remote]
##     [-comp_based_stats compo] [-use_sw_tback] [-version]
## 
## DESCRIPTION
##    Translated Query-Protein Subject BLAST 2.14.0+
## 
## Use '-help' to print detailed descriptions of command line arguments

Make blast database

Download the reference. The reference used here is from metazoogene and is the file for all marine fauna and flora of the world oceans

#download from url
cd /Users/hailaschultz/GitHub/haila-ZoopMetabarcoding/data
curl -O https://www.st.nmfs.noaa.gov/copepod/collaboration/metazoogene/atlas/data-src/MZGfasta-coi__MZGdbALL__o07__B.fasta
#unzip
gunzip -k MZGfasta-coi__MZGdbALL__o00__A.fasta

Make the database Since we are evaluating nucleotide seqeunces, make sure you are using -dbtype nucl

/Users/hailaschultz/GitHub/haila-ZoopMetabarcoding/software/ncbi-blast-2.14.0+/bin/makeblastdb \
-in /Users/hailaschultz/GitHub/haila-ZoopMetabarcoding/data/MZGfasta-coi__MZGdbALL__o00__A.fasta \
-dbtype nucl \
-out /Users/hailaschultz/GitHub/haila-ZoopMetabarcoding/blastdb/MZGfasta-coi__MZGdbALL__o00__A

Write output to a fasta file that can be blasted

transpose table

seqtab.nochim_trans <- as.data.frame(t(seqtab.nochim)) %>% rownames_to_column(var = "sequence") %>% 
    rowid_to_column(var = "OTUNumber") %>% mutate(OTUNumber = sprintf("otu%04d", 
    OTUNumber)) %>% mutate(sequence = str_replace_all(sequence, "(-|\\.)", ""))

convert to fasta file

df <- seqtab.nochim_trans
seq_out <- Biostrings::DNAStringSet(df$sequence)

names(seq_out) <- str_c(df$OTUNumber, df$Supergroup, df$Division, df$Class, 
    df$Order, df$Family, df$Genus, df$Species, sep = "|")

Biostrings::writeXStringSet(seq_out, str_c( "Zoop_ASV.fasta"), compress = FALSE, 
    width = 20000)

#I had to move this file from my working directory to my data directory

examine fasta file

head /Users/hailaschultz/GitHub/haila-ZoopMetabarcoding/data/Zoop_ASV.fasta

## >otu0001
## TTTATCTGCAGGAATTGCACATGCTGGGGCTTCTGTAGATATAGGAATCTTTTCACTACATATTGCTGGAGCCTCTTCTATTTTAGGAGCAGTTAATTTCATTACAACTGTAATTAATATGCGATCAGCAGGAATAACTATAGACCGTATCCCCTTATTCGTATGGTCTGTGTTTATTACAGCAATTTTACTTTTATTGTCTCTACCAGTTTTAGCTGGAGCAATCACAATACTTTTAACAGACCGTAATTTAAATACATCTTTCTTCGACCCTGCCGGGGGTGGTGACCCTATTTTATATCAACATCTATTC
## >otu0002
## CTTATCAAGAAATATCGCCCATGCTGGCGCGTCTGTAGATTTTGCAATTTTTTCTCTCCATTTAGCAGGAGTAAGCTCGATCTTAGGCGCCGTGAATTTTATTAGAACCTTAGGAAATCTGCGAGTATTTGGAATAATTCTAGATCGTATGCCCCTGTTCGCATGGGCAGTTCTAATTACGGCTGTACTATTACTCCTCTCCTTGCCAGTTTTGGCAGGGGCTATCACAATACTTCTCACTGACCGTAACTTAAATACCACATTTTACGATGTGGGGGGGGGTGGAGATCCTATTCTATATCAACATTTATTT
## >otu0003
## CTTATCAAGAAATATCGCCCATGCTGGCGCGTCTGTAGATTTTGCAATTTTTTCTCTCCATTTAGCAGGAGTAAGCTCGATCTTAGGCGCCGTGAATTTTATTAGAACCTTAGGAAATCTGCGAGTATTTGGAATAATTCTAGATCGTATGCCCCTGTTCGCATGGGCAGTTCTAATTACGGCTGTACTATTACTCCTCTCCTTGCCAGTTTTGGCAGGGGCTATCACAATACTTCTCACTGACCGTAACTTAAATACCACATTTTACGATGTGGGGGGGGGTGGAGATCCTATCCTATACCAACATTTATTT
## >otu0004
## CTTATCAAGAAATATCGCCCATGCTGGCGCGTCTGTAGATTTTGCAATTTTTTCTCTCCATTTAGCAGGAGTAAGCTCGATCTTAGGCGCCGTGAATTTTATTAGAACCCTAGGAAATCTGCGAGTATTTGGAATAATTCTAGATCGTATGCCCCTGTTCGCATGGGCAGTTCTAATTACGGCTGTACTATTACTCCTCTCCTTGCCAGTTTTGGCAGGGGCTATCACAATACTTCTCACTGACCGTAACTTAAATACCACATTTTACGATGTGGGGGGGGGCGGAGATCCTATTCTATATCAACATTTATTT
## >otu0005
## CTTATCAGGTTCACAAACCCATTCAGGAGGATCAGTAGATATGGCTATATTTAGTTTACATTGTGCTGGTGCATCTTCTATTATGGGAGCAATAAATTTTATTACCACAATTTTCAATATGAGAGCCCCAGGATTAACTTTAGATAAATTACCATTATTCGTATGATCTGTATTAATCACTGCTTTCTTATTATTATTATCTCTACCTGTGTTAGCCGGTGCTATAACAATGTTGTTAACCGACAGAAATTTCAATACTACATTTTTTGATCCTGCAGGAGGTGGAGATCCTGTATTATACCAACATTTATTT

Run Blast

Since we are looking at nucleotides, make sure you are using blastn.

/Users/hailaschultz/GitHub/haila-ZoopMetabarcoding/software/ncbi-blast-2.14.0+/bin/blastn \
-query /Users/hailaschultz/GitHub/haila-ZoopMetabarcoding/data/Zoop_ASV.fasta \
-db /Users/hailaschultz/GitHub/haila-ZoopMetabarcoding/blastdb/MZGfasta-coi__MZGdbALL__o00__A \
-out /Users/hailaschultz/GitHub/haila-ZoopMetabarcoding/output/Zoop_ASV.tab \
-evalue 1E-20 \
-num_threads 8 \
-max_target_seqs 1 \
-outfmt 6

Examine blast output

head -2 ../output/Zoop_ASV.tab
wc -l ../output/Zoop_ASV.tab

## otu0001  QHAK892-21__Aegina_citrea   100.000 313 0   0   1   313 346 658 3.64e-164   579
## otu0002  CAISN473-13__Calanus_pacificus  100.000 313 0   0   1   313 345 657 3.64e-164   579
##     1849 ../output/Zoop_ASV.tab

These are species we expect to see - looks like blast worked!

Visualization

Filter Blast hits and merge tables

merge Blast IDs with OTU table

#change directory
setwd("/Users/hailaschultz/GitHub/haila-ZoopMetabarcoding/output")

#read ASV table into R
Zoop_ASV<-read.table("Zoop_ASV.tab")

#rename columns
colnames(Zoop_ASV) = c("OTUNumber", "Species", "pident", "length","mismatch","gapopen","qstart","qend","sstart","send","evalue","bitscore")

#merge tables by otu number
read_table<-left_join(Zoop_ASV, seqtab.nochim_trans, by = join_by("OTUNumber" == "OTUNumber"))
datatable(read_table, fillContainer = T)

Visualize part of the table here

Filter hits

# remove ASV with sequences shorter than 300 bp
read_table_sub <- subset(read_table, read_table$length>300) 

#remove ASV with pident <95%
read_table_sub2 <- subset(read_table_sub, read_table_sub$pident>98)

Sum Sequences by Taxa

#remove unneeded columns
read_table_summed<- read_table_sub2[ -c(1,3:13) ]

#remove prefix
read_table_summed$Species <- sub(".*__", "", read_table_summed$Species)

#summarize by species
by_species <- read_table_summed %>%
  group_by(Species)

read_table_species<-by_species %>%
  summarise_all(sum)

datatable(read_table_species, fillContainer = T)

write read table to .csv

write.csv(read_table_species, "/Users/hailaschultz/GitHub/haila-ZoopMetabarcoding/output/read_table.csv", row.names=FALSE)

NMDS

# Transpose data
read_table_transposed <- t(read_table_species)
datatable(read_table_transposed, fillContainer = T)

Prep table for NMDS

# get row and colnames in order
rownames(read_table_transposed) <- colnames(read_table_species)
#move first row to column names
read_table_final<-row_to_names(read_table_transposed, 1, remove_rows_above = FALSE)

#convert to matrix
community_matrix<-as.matrix(read_table_final)
#convert to numeric matrix
community_matrix<- matrix(as.numeric(community_matrix),   
                  ncol = ncol(community_matrix))
#convert to proportions
community_matrix<-community_matrix/rowSums(community_matrix)
#arcsine sqrt transformation
community_matrix_sqrt<-asin(sqrt(community_matrix))

Run NMDS

#run NMDS
NMDS=metaMDS(community_matrix_sqrt,distance="bray",trymax=100)

## Run 0 stress 0.1899421 
## Run 1 stress 0.1960882 
## Run 2 stress 0.1892941 
## ... New best solution
## ... Procrustes: rmse 0.06439918  max resid 0.3391543 
## Run 3 stress 0.1893151 
## ... Procrustes: rmse 0.001927401  max resid 0.01091176 
## Run 4 stress 0.1952756 
## Run 5 stress 0.1994433 
## Run 6 stress 0.1960878 
## Run 7 stress 0.2094945 
## Run 8 stress 0.1959399 
## Run 9 stress 0.2056893 
## Run 10 stress 0.1892941 
## ... New best solution
## ... Procrustes: rmse 1.719537e-05  max resid 5.140785e-05 
## ... Similar to previous best
## Run 11 stress 0.1832883 
## ... New best solution
## ... Procrustes: rmse 0.03756408  max resid 0.240882 
## Run 12 stress 0.2072553 
## Run 13 stress 0.2012439 
## Run 14 stress 0.1893149 
## Run 15 stress 0.1832884 
## ... Procrustes: rmse 2.77944e-05  max resid 0.0001292051 
## ... Similar to previous best
## Run 16 stress 0.1892941 
## Run 17 stress 0.2026736 
## Run 18 stress 0.2052709 
## Run 19 stress 0.1899421 
## Run 20 stress 0.1959399 
## *** Best solution repeated 1 times

stressplot(NMDS)

plot(NMDS)

NMDS

## 
## Call:
## metaMDS(comm = community_matrix_sqrt, distance = "bray", trymax = 100) 
## 
## global Multidimensional Scaling using monoMDS
## 
## Data:     community_matrix_sqrt 
## Distance: bray 
## 
## Dimensions: 2 
## Stress:     0.1832883 
## Stress type 1, weak ties
## Best solution was repeated 1 time in 20 tries
## The best solution was from try 11 (random start)
## Scaling: centring, PC rotation, halfchange scaling 
## Species: expanded scores based on 'community_matrix_sqrt'

Make envrionmental table

#export file names
env <- as.data.frame(row.names(read_table_final))
#change column name
colnames(env)[colnames(env) == "row.names(read_table_final)"] <- "file"

#create year column
env$year <- substr(env$file, 1, 4)
#create month column
env$month <- substr(env$file, 5, 7)
#create station column
env$station <- str_extract(env$file, "P12|P22|P28|P38|P402|P4|P8")
#create run column
env$run <- str_extract(env$file, "48samples|-P1|-P2")

extract scores

data.scores = as.data.frame(scores(NMDS)$sites)
#add environmental columns
data.scores$station = env$station
data.scores$year = env$year
data.scores$month = env$month
data.scores$run = env$run

make hulls

P12 <- data.scores[data.scores$station == "P12", ][chull(data.scores[data.scores$station == 
    "P12", c("NMDS1", "NMDS2")]), ]  
P22 <- data.scores[data.scores$station == "P22", ][chull(data.scores[data.scores$station == 
    "P22", c("NMDS1", "NMDS2")]), ] 
P28 <- data.scores[data.scores$station == "P28", ][chull(data.scores[data.scores$station == 
    "P28", c("NMDS1", "NMDS2")]), ]  
P38 <- data.scores[data.scores$station == "P38", ][chull(data.scores[data.scores$station == 
    "P38", c("NMDS1", "NMDS2")]), ]  
P402 <- data.scores[data.scores$station == "P402", ][chull(data.scores[data.scores$station == 
    "P402", c("NMDS1", "NMDS2")]), ]  
P4 <- data.scores[data.scores$station == "P4", ][chull(data.scores[data.scores$station == 
   "P4", c("NMDS1", "NMDS2")]), ]  
P8 <- data.scores[data.scores$station == "P8", ][chull(data.scores[data.scores$station == 
   "P8", c("NMDS1", "NMDS2")]), ]  
hull.data <- rbind(P12,P22,P28,P38,P402,P4,P8)  
hull.data

##           NMDS1       NMDS2 station year month       run
## 2  -0.112186590 -0.20038331     P12 2018   APR       -P1
## 1  -0.004621414 -0.46531672     P12 2018   APR 48samples
## 34 -0.067219247 -0.48243337     P12 2019   APR 48samples
## 28 -0.648794815 -0.25970112     P12 2018   SEP       -P2
## 15 -0.729772635  0.19024808     P12 2018   JUL       -P1
## 14 -0.622680222  0.16189395     P12 2018   JUL 48samples
## 35  1.079475184  0.48107118     P22 2019   APR       -P1
## 4   0.667278931  0.40447979     P22 2018   APR       -P1
## 29 -0.097041549  0.43460112     P22 2018   SEP       -P2
## 48 -0.258342116  0.57004444     P22 2019   JUL 48samples
## 17  0.212365195  0.64240115     P22 2018   JUL       -P1
## 3   0.813363720  0.62487343     P22 2018   APR 48samples
## 5   0.177077983  0.12877803     P28 2018   APR 48samples
## 38  0.165371958 -0.17471828     P28 2019   APR       -P1
## 30 -0.336447310  0.16148928     P28 2018   SEP       -P2
## 18 -0.571945077  0.49853483     P28 2018   JUL 48samples
## 19 -0.523413743  0.56877836     P28 2018   JUL       -P1
## 6   0.010912322  0.28216640     P28 2018   APR       -P1
## 21 -0.233613126 -0.36011027     P38 2018   JUL       -P1
## 31 -0.699579613 -0.04538073     P38 2018   SEP       -P2
## 20 -0.563789366  0.06512851     P38 2018   JUL 48samples
## 22 -0.206922933 -0.06308197     P38 2018   JUL       -P2
## 8   0.591186080 -0.87070343    P402 2018   APR 48samples
## 9   0.548797962 -0.94943279    P402 2018   APR       -P1
## 32 -0.521996522 -0.79012961    P402 2018   SEP       -P2
## 23 -0.566423990  0.18445570    P402 2018   JUL       -P2
## 11  0.844780444 -0.44981442      P4 2018   APR       -P1
## 10  0.666934421 -0.47573956      P4 2018   APR 48samples
## 33 -0.481717290 -0.52361003      P4 2018   SEP       -P2
## 25 -0.562367133  0.33926174      P4 2018   JUL       -P1
## 13  0.889512353  0.06190187      P8 2018   APR       -P1
## 43  0.851043590 -0.06687479      P8 2019   APR 48samples
## 45  0.760543725 -0.08054055      P8 2019   APR       -P1
## 44  0.579301397 -0.08084995      P8 2019   APR       -P1
## 26 -0.454892841  0.15837750      P8 2018   JUL 48samples
## 27  0.114730656  0.30994575      P8 2018   JUL       -P1
## 12  0.699510275  0.15350067      P8 2018   APR 48samples

station_plot = ggplot(data = data.scores, aes(x = NMDS1, y = NMDS2))+
  geom_polygon(data=hull.data,aes(x=NMDS1,y=NMDS2,fill=station,group=station),alpha=0.30) + 
     geom_point(data = data.scores, aes(colour = station), size = 3, alpha = 0.5) + 
     scale_colour_manual(values = c("orange", "steelblue","darkgreen","violet","red","darkblue","limegreen")) +
  scale_fill_manual(values = c("orange", "steelblue","darkgreen","violet","red","darkblue","limegreen"))+
     theme(axis.title = element_text(size = 10, face = "bold", colour = "grey30"), 
     panel.background = element_blank(), panel.border = element_rect(fill = NA, colour = "grey30"), 
     axis.ticks = element_blank(), axis.text = element_blank(), legend.key = element_blank(), 
     legend.title = element_text(size = 10, face = "bold", colour = "grey30"), 
     legend.text = element_text(size = 9, colour = "grey30")) +
     labs(colour = "Station")
  
     
station_plot

setwd("/Users/hailaschultz/GitHub/haila-ZoopMetabarcoding/plots")
ggsave(plot = station_plot, width = 5, height = 3, dpi = 300, filename = "station_plot.png")

There is quite a bit of overlap among stations, but you can see that some stations are distinct from one another. For example, P22 is different from P38, but P38 and P12 are very similar.

create year plot make hulls

y2018 <- data.scores[data.scores$year == "2018", ][chull(data.scores[data.scores$year == 
    "2018", c("NMDS1", "NMDS2")]), ]  
y2019 <- data.scores[data.scores$year == "2019", ][chull(data.scores[data.scores$year == 
    "2019", c("NMDS1", "NMDS2")]), ]  

hull.data.2 <- rbind(y2018,y2019)  
hull.data.2

##          NMDS1       NMDS2 station year month       run
## 11  0.84478044 -0.44981442      P4 2018   APR       -P1
## 9   0.54879796 -0.94943279    P402 2018   APR       -P1
## 32 -0.52199652 -0.79012961    P402 2018   SEP       -P2
## 28 -0.64879482 -0.25970112     P12 2018   SEP       -P2
## 31 -0.69957961 -0.04538073     P38 2018   SEP       -P2
## 15 -0.72977264  0.19024808     P12 2018   JUL       -P1
## 18 -0.57194508  0.49853483     P28 2018   JUL 48samples
## 19 -0.52341374  0.56877836     P28 2018   JUL       -P1
## 17  0.21236519  0.64240115     P22 2018   JUL       -P1
## 3   0.81336372  0.62487343     P22 2018   APR 48samples
## 13  0.88951235  0.06190187      P8 2018   APR       -P1
## 43  0.85104359 -0.06687479      P8 2019   APR 48samples
## 34 -0.06721925 -0.48243337     P12 2019   APR 48samples
## 46 -0.51833742 -0.24861449     P12 2019   JUL 48samples
## 48 -0.25834212  0.57004444     P22 2019   JUL 48samples
## 35  1.07947518  0.48107118     P22 2019   APR       -P1

year_plot = ggplot(data = data.scores, aes(x = NMDS1, y = NMDS2)) + 
  geom_polygon(data=hull.data.2,aes(x=NMDS1,y=NMDS2,fill=year,group=year),alpha=0.30)+
     geom_point(data = data.scores, aes(colour = year), size = 3, alpha = 0.5) + 
     scale_colour_manual(values = c("orange", "steelblue")) + 
  scale_fill_manual(values = c("orange", "steelblue")) + 
     theme(axis.title = element_text(size = 10, face = "bold", colour = "grey30"), 
     panel.background = element_blank(), panel.border = element_rect(fill = NA, colour = "grey30"), 
     axis.ticks = element_blank(), axis.text = element_blank(), legend.key = element_blank(), 
     legend.title = element_text(size = 10, face = "bold", colour = "grey30"), 
     legend.text = element_text(size = 9, colour = "grey30")) +
     labs(colour = "Year")
     
year_plot

setwd("/Users/hailaschultz/GitHub/haila-ZoopMetabarcoding/plots")
ggsave(plot = year_plot, width = 5, height = 3, dpi = 300, filename = "year_plot.png")

between these two years, there doesn’t seem to be much difference

create month plot make hulls

APR <- data.scores[data.scores$month == "APR", ][chull(data.scores[data.scores$month == 
    "APR", c("NMDS1", "NMDS2")]), ]  
JUL <- data.scores[data.scores$month == "JUL", ][chull(data.scores[data.scores$month == 
    "JUL", c("NMDS1", "NMDS2")]), ]  
SEP <- data.scores[data.scores$month == "SEP", ][chull(data.scores[data.scores$month == 
    "SEP", c("NMDS1", "NMDS2")]), ]  

hull.data.3 <- rbind(APR,JUL,SEP)  
hull.data.3

##          NMDS1       NMDS2 station year month       run
## 11  0.84478044 -0.44981442      P4 2018   APR       -P1
## 9   0.54879796 -0.94943279    P402 2018   APR       -P1
## 34 -0.06721925 -0.48243337     P12 2019   APR 48samples
## 41 -0.41125703 -0.13959010     P38 2019   APR       -P1
## 6   0.01091232  0.28216640     P28 2018   APR       -P1
## 3   0.81336372  0.62487343     P22 2018   APR 48samples
## 35  1.07947518  0.48107118     P22 2019   APR       -P1
## 21 -0.23361313 -0.36011027     P38 2018   JUL       -P1
## 46 -0.51833742 -0.24861449     P12 2019   JUL 48samples
## 15 -0.72977264  0.19024808     P12 2018   JUL       -P1
## 18 -0.57194508  0.49853483     P28 2018   JUL 48samples
## 19 -0.52341374  0.56877836     P28 2018   JUL       -P1
## 17  0.21236519  0.64240115     P22 2018   JUL       -P1
## 27  0.11473066  0.30994575      P8 2018   JUL       -P1
## 32 -0.52199652 -0.79012961    P402 2018   SEP       -P2
## 28 -0.64879482 -0.25970112     P12 2018   SEP       -P2
## 31 -0.69957961 -0.04538073     P38 2018   SEP       -P2
## 29 -0.09704155  0.43460112     P22 2018   SEP       -P2

month_plot = ggplot(data = data.scores, aes(x = NMDS1, y = NMDS2)) + 
  geom_polygon(data=hull.data.3,aes(x=NMDS1,y=NMDS2,fill=month,group=month,alpha=0.30))+ 
     geom_point(data = data.scores, aes(colour = month), size = 3, alpha = 0.5) + 
     scale_colour_manual(values = c("orange", "steelblue","darkgreen")) + 
       scale_fill_manual(values = c("orange", "steelblue","darkgreen")) +
     theme(axis.title = element_text(size = 10, face = "bold", colour = "grey30"), 
     panel.background = element_blank(), panel.border = element_rect(fill = NA, colour = "grey30"), 
     axis.ticks = element_blank(), axis.text = element_blank(), legend.key = element_blank(), 
     legend.title = element_text(size = 10, face = "bold", colour = "grey30"), 
     legend.text = element_text(size = 9, colour = "grey30")) +
     labs(colour = "Month")
     
month_plot

setwd("/Users/hailaschultz/GitHub/haila-ZoopMetabarcoding/plots")
ggsave(plot = month_plot, width = 5, height = 3, dpi = 300, filename = "month_plot.png")

There seem to be some clear seasonal differences along axis 1 - April appears to be distinct

create run plot

run_plot = ggplot(data = data.scores, aes(x = NMDS1, y = NMDS2)) + 
     geom_point(data = data.scores, aes(colour = run), size = 3, alpha = 0.5) + 
     scale_colour_manual(values = c("orange", "steelblue","darkgreen")) + 
     theme(axis.title = element_text(size = 10, face = "bold", colour = "grey30"), 
     panel.background = element_blank(), panel.border = element_rect(fill = NA, colour = "grey30"), 
     axis.ticks = element_blank(), axis.text = element_blank(), legend.key = element_blank(), 
     legend.title = element_text(size = 10, face = "bold", colour = "grey30"), 
     legend.text = element_text(size = 9, colour = "grey30")) +
     labs(colour = "Run")
     
run_plot

setwd("/Users/hailaschultz/GitHub/haila-ZoopMetabarcoding/plots")
ggsave(plot = run_plot, width = 5, height = 3, dpi = 300, filename = "run_plot.png")

Diversity Indices

#convert to matrix
diversity<-as.matrix(read_table_final)
#convert to numeric matrix
diversity_matrix<- matrix(as.numeric(diversity),   
                  ncol = ncol(diversity))
#add in row and column labels
colnames(diversity_matrix) <- colnames(diversity)
rownames(diversity_matrix) <- rownames(diversity)

get shannon diversity index

shannon_scores<-as.data.frame(diversity(diversity_matrix, index="shannon"))

add environment columns back in

#change column name
colnames(shannon_scores)[1] <- "shannon_index"
#add environmental columns
shannon_scores$station = env$station
shannon_scores$year = env$year
shannon_scores$month = env$month
shannon_scores$run = env$run

make violin plot

diversity<- ggviolin(shannon_scores, x = "station", y = "shannon_index",
 add = "boxplot", fill = "station") 
diversity

setwd("/Users/hailaschultz/GitHub/haila-ZoopMetabarcoding/plots")
ggsave(plot = diversity, width = 5, height = 3, dpi = 300, filename = "diversity.png")

Overall, the highest diversity was seen at P22, which is the station located in the Strait of Juan de Fuca. This site is closest to the ocean, so it may contain inland taxa as well as more offshore taxa.

Taxa plot

get the top 20 most abundant species

#convert to proportions
diversity_prop<-diversity_matrix/rowSums(diversity_matrix)
species_means <- as.data.frame(colMeans(diversity_prop))
colnames(species_means)[1] <- "abundance"
species_means$abundance<-as.numeric(species_means$abundance)
species_means <- rownames_to_column(species_means, "species")
species_means <-as.data.frame(species_means[order(-species_means$abundance),])

#get top 20 species
top_20_species <- head(species_means, 20)

plot

top20<-ggplot(top_20_species, aes(y=abundance, x=reorder(species, abundance))) + 
    geom_bar(position="dodge", stat="identity")+ theme_bw()+coord_flip()+ylab("Mean Proportion of Reads")+xlab("species")

top20

setwd("/Users/hailaschultz/GitHub/haila-ZoopMetabarcoding/plots")
ggsave(plot = top20, width = 5, height = 3, dpi = 300, filename = "top20.png")

The samples were dominated by Calanus pacificus (a copepod) and Aegina citrea (a jellyfish). Other abundant species included eutonina indicans and Clytia gregaria, two other jellyfish. Other common copepods like Centropages abdonminalis and Metridia pacifica were also represented

Zooplankton COI Metabarcoding Taxonomic ID Pipeline

Haila Schultz

End Product

Project Summary

Download Data

Move files into one folder

Remove Primers with Cutadapt

Prepare data

Use Cutadapt

QC check on files after primers were trimmed

Filter Reads

Inspect read quality

filter reads using quality thresholds

QC check on files after filtering were trimmed

Pair Forward and Reverse Reads

Dereplication

Merge Paired Reads

Construct sequence table

Remove chimeras

Blast

Download software from NCBI

Make blast database

Write output to a fasta file that can be blasted

Run Blast

Visualization

Filter Blast hits and merge tables

NMDS

Diversity Indices

Taxa plot