load packages
library(knitr)
library(BiocManager)
library(ShortRead)
library("dada2")
library(dplyr)
library(janitor)
library(vegan)
library(data.table)
library(ggplot2)
library(ggpubr)
library(dplyr)
library(tidyverse)
library(DT)
library(RgoogleMaps)
library(mapplots)
My primary end product is a table with species designations of ASVs and Read numbers for each species for each sample. This table can be used as the basis for any statistical analysis or plots to be made in the future.
visualize read table
final_table <- read.csv("/Users/hailaschultz/GitHub/haila-ZoopMetabarcoding/output/read_table.csv")
datatable(final_table, options = list(scrollX = T,scrollY=T,pageLength = 10,
lengthMenu = c(10, 20,50,100)))
The report is located here: https://rpubs.com/HailaSchultz/full-pipeline-report
This project utilizes metabarcoding and the LrCOI marker to taxonomically identify zooplankton samples from Washington Ocean Acidification Center Cruises. Samples were collected from seven stations spread throughout Puget Sound in April, July, and September 2018-2020. Zooplankton samples were collected using a 200 um net towed vertically from 10 meters from the bottom of the water column to the surface. A 5 ml subsample (settled volume) of each sample was homogenized, dried, and subsampled before being sent off to Ohio State University for library prep and sequencing.
Map of the seven sites visited
This workflow is adapted from the following pipeline: https://benjjneb.github.io/dada2/ITS_workflow.html
This is step 1 of my project workflow. In this step, I am organizing my fastq files and checking their hash values. I downloaded my data from a shared google drive folder onto my computer and then uploaded into RStudio. Originally, the fastq files were located in multiple layers of directories.
Install dada2 package using BiocManager
install.packages('ShortRead')
install.packages("BiocManager")
BiocManager::install("dada2")
move multiple files from multiple layers of directories into the data directory
cd ../data
find . -name '*.gz' -exec mv {} ../data/fastq-files/ \;
create checksums file for all fastq files in data folder
cd /Users/hailaschultz/GitHub/haila-ZoopMetabarcoding/data/
shasum ../data/fastq-files/*.gz > fastq_checksums.sha
head fastq_checksums.sha
## 086979b069d473f74906329a83328e67d19d5d77 ../data/fastq-files/2018APRP12Ve-COI-48samples_S145_L001_R1_001.fastq.gz
## 7df8c51ee4e5d8108057236af4932183f6829080 ../data/fastq-files/2018APRP12Ve-COI-48samples_S145_L001_R2_001.fastq.gz
## c3faf8dcb35ceb847f155886ada5e013032d3725 ../data/fastq-files/2018APRP12VeA-COI-P1_S33_L001_R1_001.fastq.gz
## 48466cba94926b3f54c6f51c0e133dd77f6529fd ../data/fastq-files/2018APRP12VeA-COI-P1_S33_L001_R2_001.fastq.gz
## 6d369cbae9ce9959ca063ab7f82467fb3962b973 ../data/fastq-files/2018APRP22Ve-COI-48samples_S146_L001_R1_001.fastq.gz
## 7b171b3594a341a2ab6367be2281711b62ebadbc ../data/fastq-files/2018APRP22Ve-COI-48samples_S146_L001_R2_001.fastq.gz
## 9304abcd1c332881d215efe8e0138a4dccbe14c5 ../data/fastq-files/2018APRP22VeA-COI-P1_S45_L001_R1_001.fastq.gz
## d4d782733bbf8596ea9f4258b7e5c4312022a372 ../data/fastq-files/2018APRP22VeA-COI-P1_S45_L001_R2_001.fastq.gz
## c17f693c662e0322a1ad0bbac58c94d46d710333 ../data/fastq-files/2018APRP28Ve-COI-48samples_S147_L001_R1_001.fastq.gz
## 65bff1db647d30419d3b30fb1707b0c4d82d5202 ../data/fastq-files/2018APRP28Ve-COI-48samples_S147_L001_R2_001.fastq.gz
check if files match original
cd ../data
shasum -c fastq_checksums.sha > keys.txt
head keys.txt
## ../data/fastq-files/2018APRP12Ve-COI-48samples_S145_L001_R1_001.fastq.gz: OK
## ../data/fastq-files/2018APRP12Ve-COI-48samples_S145_L001_R2_001.fastq.gz: OK
## ../data/fastq-files/2018APRP12VeA-COI-P1_S33_L001_R1_001.fastq.gz: OK
## ../data/fastq-files/2018APRP12VeA-COI-P1_S33_L001_R2_001.fastq.gz: OK
## ../data/fastq-files/2018APRP22Ve-COI-48samples_S146_L001_R1_001.fastq.gz: OK
## ../data/fastq-files/2018APRP22Ve-COI-48samples_S146_L001_R2_001.fastq.gz: OK
## ../data/fastq-files/2018APRP22VeA-COI-P1_S45_L001_R1_001.fastq.gz: OK
## ../data/fastq-files/2018APRP22VeA-COI-P1_S45_L001_R2_001.fastq.gz: OK
## ../data/fastq-files/2018APRP28Ve-COI-48samples_S147_L001_R1_001.fastq.gz: OK
## ../data/fastq-files/2018APRP28Ve-COI-48samples_S147_L001_R2_001.fastq.gz: OK
To get the cutadapt software, I downloaded the miniconda installer using curl
#download miniconda from url
cd ../software
curl -O https://repo.anaconda.com/miniconda/Miniconda3-py310_23.3.1-0-MacOSX-arm64.sh
I then moved to the directory it was in by typing
cd software in the terminal. I then typed bash
Miniconda3-py310_23.3.1-0-MacOSX-arm64.sh in the terminal to start the
setup process. Conda is located in: /Users/hailaschultz/miniconda3.
I checked that installation worked by typing conda list
in the terminal. In the terminal I then typed:
conda config --add channels defaults conda config --add channels bioconda conda config --add channels conda-forge conda config --set channel_priority strict
and then conda create -n cutadaptenv cutadapt
On my personal computer, I had to use
CONDA_SUBDIR=osx-64 conda create -n cutadaptenv cutadapt
I ran conda init bash
To activate the conda environment, in the terminal I typed
conda activate cutadaptenv
First, I wanted to check the quality of the raw data to compare with future steps in the pipeline. download fastqc
curl -O https://www.bioinformatics.babraham.ac.uk/projects/fastqc/fastqc_v0.11.7.zip
use fastqc
cd /Users/hailaschultz/GitHub/haila-ZoopMetabarcoding/data/fastq-files
/Users/hailaschultz/GitHub/haila-ZoopMetabarcoding/software/FastQC/fastqc *fastq.gz -o /Users/hailaschultz/GitHub/haila-ZoopMetabarcoding/output/raw-read-qc
make multiqc report in the terminal I ran
conda install multiqc to install multiqc
cd /Users/hailaschultz/GitHub/haila-ZoopMetabarcoding/output/raw-read-qc
multiqc .
screenshot of multiqc mean quality scores
I then mad lists of the file names in order to process multiple files at once. I also specified the primer sequences used in this prject.
#set file path
path <- "../data/fastq-files"
#make matched list of forward and reverse reads
fnFs <- sort(list.files(path, pattern = "_R1_001.fastq.gz", full.names = TRUE))
fnRs <- sort(list.files(path, pattern = "_R2_001.fastq.gz", full.names = TRUE))
#specify primer sequences
FWD <- "GGWACWGGWTGAACWGTWTAYCCYCC"
REV <- "TANACYTCNGGRTGNCCRAARAAYCA"
#in the reverse primer, I replaced I with N
get orientations of primers
allOrients <- function(primer) {
# Create all orientations of the input sequence
require(Biostrings)
dna <- DNAString(primer) # The Biostrings works w/ DNAString objects rather than character vectors
orients <- c(Forward = dna, Complement = complement(dna), Reverse = reverse(dna),
RevComp = reverseComplement(dna))
return(sapply(orients, toString)) # Convert back to character vector
}
FWD.orients <- allOrients(FWD)
REV.orients <- allOrients(REV)
FWD.orients
## Forward Complement
## "GGWACWGGWTGAACWGTWTAYCCYCC" "CCWTGWCCWACTTGWCAWATRGGRGG"
## Reverse RevComp
## "CCYCCYATWTGWCAAGTWGGWCAWGG" "GGRGGRTAWACWGTTCAWCCWGTWCC"
I then checked if the primers were actually present in my files. The followincg code counts the number of times primers occur in the first sample.
primerHits <- function(primer, fn) {
# Counts number of reads in which the primer is found
nhits <- vcountPattern(primer, sread(readFastq(fn)), fixed = FALSE)
return(sum(nhits > 0))
}
rbind(FWD.ForwardReads = sapply(FWD.orients, primerHits, fn = fnFs[[1]]),
FWD.ReverseReads = sapply(FWD.orients, primerHits, fn = fnRs[[1]]),
REV.ForwardReads = sapply(REV.orients, primerHits, fn = fnFs[[1]]),
REV.ReverseReads = sapply(REV.orients, primerHits, fn = fnRs[[1]]))
## Forward Complement Reverse RevComp
## FWD.ForwardReads 63428 0 0 0
## FWD.ReverseReads 0 0 0 0
## REV.ForwardReads 0 0 0 0
## REV.ReverseReads 60704 0 0 0
It looks like the forward and reverse primers are present in the forward and reverse reads.
Tell R the path to the cutadapt software
cutadapt <- "/Users/hailaschultz/miniconda3/envs/cutadaptenv/bin/cutadapt"
#verify that r knows where to find teh software by asking for the version
system2(cutadapt, args = "--version")
Tell cutadapt what to cut
path.cut <- file.path(path, "cutadapt")
if(!dir.exists(path.cut)) dir.create(path.cut)
fnFs.cut <- file.path(path.cut, basename(fnFs))
fnRs.cut <- file.path(path.cut, basename(fnRs))
FWD.RC <- dada2:::rc(FWD)
REV.RC <- dada2:::rc(REV)
# Trim FWD and the reverse-complement of REV off of R1 (forward reads)
R1.flags <- paste("-g", FWD, "-a", REV.RC)
# Trim REV and the reverse-complement of FWD off of R2 (reverse reads)
R2.flags <- paste("-G", REV, "-A", FWD.RC)
Run Cutadapt
for(i in seq_along(fnFs)) {
system2(cutadapt, args = c(R1.flags, R2.flags, "-n", 2, # -n 2 required to remove FWD and REV from reads
"-o", fnFs.cut[i], "-p", fnRs.cut[i], # output files
fnFs[i], fnRs[i])) # input files
}
Check if adapters were actuall trimmed
rbind(FWD.ForwardReads = sapply(FWD.orients, primerHits, fn = fnFs.cut[[1]]),
FWD.ReverseReads = sapply(FWD.orients, primerHits, fn = fnRs.cut[[1]]),
REV.ForwardReads = sapply(REV.orients, primerHits, fn = fnFs.cut[[1]]),
REV.ReverseReads = sapply(REV.orients, primerHits, fn = fnRs.cut[[1]]))
## Forward Complement Reverse RevComp
## FWD.ForwardReads 0 0 0 0
## FWD.ReverseReads 0 0 0 0
## REV.ForwardReads 0 0 0 0
## REV.ReverseReads 5 0 0 0
There were a few reverse reads that were not trimmed, but overall it looks good. However, after trimming the primers, about half of the files ended up with zero data/reads. I think it is an issue with the read names. The error I get is: ERROR: Error in sequence file at unknown line: Reads are improperly paired. Read name ‘M02815:67:000000000-KDDDG:1:1101:17448:1214 1:N:0:CGGAGCCT+GTAAGGAG’ in file 1 does not match ‘M02815:67:000000000-KDDDG:1:1101:20493:1232 2:N:0:CGTACTAG+CGTCTAAT’ in file 2, etc.
Delete files that didn’t make it through cutadapt - smaller than 1 MB in the cutadapt directory
cd /Users/hailaschultz/GitHub/haila-ZoopMetabarcoding/data/fastq-files/cutadapt
find . -type f -name "*.gz" -size -1M -delete
use fastqc
cd /Users/hailaschultz/GitHub/haila-ZoopMetabarcoding/data/fastq-files/cutadapt
/Users/hailaschultz/GitHub/haila-ZoopMetabarcoding/software/FastQC/fastqc *fastq.gz -o /Users/hailaschultz/GitHub/haila-ZoopMetabarcoding/output/after-cutadapt-qc
make multiqc report to check files after trimming primers
cd /Users/hailaschultz/GitHub/haila-ZoopMetabarcoding/output/after-cutadapt-qc
multiqc .
screenshot of multiqc mean quality scores after cutadapt
Get the sample names for the files that have been trimmed
# Forward and reverse fastq filenames have the format:
cutFs <- sort(list.files(path.cut, pattern = "_R1_001.fastq.gz", full.names = TRUE))
cutRs <- sort(list.files(path.cut, pattern = "_R2_001.fastq.gz", full.names = TRUE))
# Extract sample names, assuming filenames have format:
get.sample.name <- function(fname) strsplit(basename(fname), "_")[[1]][1]
sample.names <- unname(sapply(cutFs, get.sample.name))
head(sample.names)
## [1] "2018APRP12Ve-COI-48samples" "2018APRP12VeA-COI-P1"
## [3] "2018APRP22Ve-COI-48samples" "2018APRP22VeA-COI-P1"
## [5] "2018APRP28Ve-COI-48samples" "2018APRP28VeA-COI-P1"
plot forward read quality for first two files
plotQualityProfile(cutFs[1:2])
plot reverse read quality for first two files
plotQualityProfile(cutRs[1:2])
The forward reads are better quality, as we might expect.
Designate file names location for the filtered files
filtFs <- file.path(path.cut, "filtered", basename(cutFs))
filtRs <- file.path(path.cut, "filtered", basename(cutRs))
Before filtering, verify that all of your files have a matching pair. Both the forward and reverse reads for each sample are required for this step.
# get sample names to see which files match
get.sample.name <- function(fname) strsplit(basename(fname), "_")[[1]][1]
sample.namesF <- unname(sapply(cutFs, get.sample.name))
sample.namesR <- unname(sapply(cutRs, get.sample.name))
Check for differences between lists and remove files that don’t have a matching pair
difsR <- setdiff(sample.namesR,sample.namesF)
difsR
## character(0)
difsF <- setdiff(sample.namesF,sample.namesR)
difsF
## character(0)
It looks like there are no differences, so we can proceed
maxN: allows zero Ns max EE: set to 5 for forward and reverse truncQ: set to 2 minlen: minimum length is 100 reads rm.phix:remove phix Note that this step takes a long time
out <- filterAndTrim(cutFs, filtFs, cutRs, filtRs, maxN = 0, maxEE = c(5, 5),
truncQ = 2, minLen = 100, rm.phix = TRUE, compress = TRUE, multithread = TRUE)
Check how many samples went through filtering
cd /Users/hailaschultz/GitHub/haila-ZoopMetabarcoding/data/fastq-files/cutadapt
ls | wc -l
cd /Users/hailaschultz/GitHub/haila-ZoopMetabarcoding/data/fastq-files/cutadapt/filtered
ls | wc -l
## 97
## 96
learn the error rates
forward read error rates
errF <- learnErrors(filtFs, multithread=TRUE)
## 118001220 total bases in 432610 reads from 8 samples will be used for learning the error rates.
reverse read error rates
errR <- learnErrors(filtRs, multithread=TRUE)
## 118551091 total bases in 432610 reads from 8 samples will be used for learning the error rates.
plot the errors
plotErrors(errF, nominalQ=TRUE)
#looks okay! good to proceed
plotErrors(errR, nominalQ=TRUE)
#looks okay! good to proceed
use fastqc
cd /Users/hailaschultz/GitHub/haila-ZoopMetabarcoding/data/fastq-files/cutadapt/filtered
/Users/hailaschultz/GitHub/haila-ZoopMetabarcoding/software/FastQC/fastqc *fastq.gz -o /Users/hailaschultz/GitHub/haila-ZoopMetabarcoding/output/after-filtering-qc
make multiqc report to check files after trimming primers
cd /Users/hailaschultz/GitHub/haila-ZoopMetabarcoding/output/after-filtering-qc
multiqc .
screenshot of multiqc mean quality scores after filtering
This step takes a little while
derepFs <- derepFastq(filtFs, verbose=TRUE)
derepRs <- derepFastq(filtRs, verbose=TRUE)
# Name the derep-class objects by the sample names
names(derepFs) <- sample.names
names(derepRs) <- sample.names
Apply core sample inference algorithm
dadaFs <- dada(derepFs, err=errF, multithread=TRUE)
## Sample 1 - 53380 reads in 17256 unique sequences.
## Sample 2 - 71371 reads in 23411 unique sequences.
## Sample 3 - 39755 reads in 17518 unique sequences.
## Sample 4 - 57973 reads in 24827 unique sequences.
## Sample 5 - 17803 reads in 9422 unique sequences.
## Sample 6 - 43005 reads in 17527 unique sequences.
## Sample 7 - 73572 reads in 24346 unique sequences.
## Sample 8 - 75751 reads in 25518 unique sequences.
## Sample 9 - 98056 reads in 27273 unique sequences.
## Sample 10 - 76191 reads in 26240 unique sequences.
## Sample 11 - 111350 reads in 38178 unique sequences.
## Sample 12 - 19702 reads in 9230 unique sequences.
## Sample 13 - 43954 reads in 16795 unique sequences.
## Sample 14 - 55062 reads in 20106 unique sequences.
## Sample 15 - 75738 reads in 27839 unique sequences.
## Sample 16 - 50585 reads in 20748 unique sequences.
## Sample 17 - 31666 reads in 15173 unique sequences.
## Sample 18 - 54877 reads in 19944 unique sequences.
## Sample 19 - 51989 reads in 19995 unique sequences.
## Sample 20 - 60471 reads in 20567 unique sequences.
## Sample 21 - 59368 reads in 15549 unique sequences.
## Sample 22 - 66452 reads in 20503 unique sequences.
## Sample 23 - 47925 reads in 18563 unique sequences.
## Sample 24 - 60167 reads in 21126 unique sequences.
## Sample 25 - 67683 reads in 24503 unique sequences.
## Sample 26 - 50051 reads in 17031 unique sequences.
## Sample 27 - 81573 reads in 30100 unique sequences.
## Sample 28 - 29981 reads in 10539 unique sequences.
## Sample 29 - 28029 reads in 11744 unique sequences.
## Sample 30 - 34155 reads in 13168 unique sequences.
## Sample 31 - 30155 reads in 13727 unique sequences.
## Sample 32 - 13794 reads in 5809 unique sequences.
## Sample 33 - 24459 reads in 9974 unique sequences.
## Sample 34 - 56790 reads in 15822 unique sequences.
## Sample 35 - 81513 reads in 36376 unique sequences.
## Sample 36 - 95936 reads in 33051 unique sequences.
## Sample 37 - 61090 reads in 24304 unique sequences.
## Sample 38 - 44362 reads in 13688 unique sequences.
## Sample 39 - 57544 reads in 18866 unique sequences.
## Sample 40 - 53499 reads in 16883 unique sequences.
## Sample 41 - 31676 reads in 10532 unique sequences.
## Sample 42 - 43393 reads in 13493 unique sequences.
## Sample 43 - 71928 reads in 22277 unique sequences.
## Sample 44 - 23397 reads in 7994 unique sequences.
## Sample 45 - 25502 reads in 8731 unique sequences.
## Sample 46 - 75714 reads in 21533 unique sequences.
## Sample 47 - 74916 reads in 21520 unique sequences.
## Sample 48 - 60854 reads in 24756 unique sequences.
dadaRs <- dada(derepRs, err=errR, multithread=TRUE)
## Sample 1 - 53380 reads in 32871 unique sequences.
## Sample 2 - 71371 reads in 45076 unique sequences.
## Sample 3 - 39755 reads in 24242 unique sequences.
## Sample 4 - 57973 reads in 36323 unique sequences.
## Sample 5 - 17803 reads in 13096 unique sequences.
## Sample 6 - 43005 reads in 26214 unique sequences.
## Sample 7 - 73572 reads in 36970 unique sequences.
## Sample 8 - 75751 reads in 34450 unique sequences.
## Sample 9 - 98056 reads in 53913 unique sequences.
## Sample 10 - 76191 reads in 46465 unique sequences.
## Sample 11 - 111350 reads in 53411 unique sequences.
## Sample 12 - 19702 reads in 11521 unique sequences.
## Sample 13 - 43954 reads in 24136 unique sequences.
## Sample 14 - 55062 reads in 39868 unique sequences.
## Sample 15 - 75738 reads in 49706 unique sequences.
## Sample 16 - 50585 reads in 36477 unique sequences.
## Sample 17 - 31666 reads in 23794 unique sequences.
## Sample 18 - 54877 reads in 38824 unique sequences.
## Sample 19 - 51989 reads in 34222 unique sequences.
## Sample 20 - 60471 reads in 42821 unique sequences.
## Sample 21 - 59368 reads in 24020 unique sequences.
## Sample 22 - 66452 reads in 35367 unique sequences.
## Sample 23 - 47925 reads in 30913 unique sequences.
## Sample 24 - 60167 reads in 41059 unique sequences.
## Sample 25 - 67683 reads in 46762 unique sequences.
## Sample 26 - 50051 reads in 34700 unique sequences.
## Sample 27 - 81573 reads in 51233 unique sequences.
## Sample 28 - 29981 reads in 16666 unique sequences.
## Sample 29 - 28029 reads in 19035 unique sequences.
## Sample 30 - 34155 reads in 23534 unique sequences.
## Sample 31 - 30155 reads in 19035 unique sequences.
## Sample 32 - 13794 reads in 7944 unique sequences.
## Sample 33 - 24459 reads in 12034 unique sequences.
## Sample 34 - 56790 reads in 30034 unique sequences.
## Sample 35 - 81513 reads in 43030 unique sequences.
## Sample 36 - 95936 reads in 52721 unique sequences.
## Sample 37 - 61090 reads in 34848 unique sequences.
## Sample 38 - 44362 reads in 23177 unique sequences.
## Sample 39 - 57544 reads in 29463 unique sequences.
## Sample 40 - 53499 reads in 24299 unique sequences.
## Sample 41 - 31676 reads in 15476 unique sequences.
## Sample 42 - 43393 reads in 25007 unique sequences.
## Sample 43 - 71928 reads in 37035 unique sequences.
## Sample 44 - 23397 reads in 11245 unique sequences.
## Sample 45 - 25502 reads in 11927 unique sequences.
## Sample 46 - 75714 reads in 35334 unique sequences.
## Sample 47 - 74916 reads in 35689 unique sequences.
## Sample 48 - 60854 reads in 45922 unique sequences.
mergers <- mergePairs(dadaFs, derepFs, dadaRs, derepRs, verbose=TRUE)
# Inspect the merger data.frame from the first sample
head(mergers[[1]])
## sequence
## 1 TCTCGCAAGCAATCTTGCCCACGCTGGCGCCAGCGTGGACCTCACTATTTTCTCACTTCACTTAGCAGGCGTTTCCTCAATTCTGGGAGCAATTAATTTTATTACTACTATTATCAATATGAAGCCCCCTGCAATCTCACAATACCAGACACCCCTCTTTGTTTGATCCGTACTTATTACAGCTGTTCTTCTCCTACTCTCCCTGCCCGTCTTAGCCGCCGGCATCACAATACTACTAACTGACCGAAACCTCAACACCTCCTTCTTTGACCCCGCCGGAGGGGGAGACCCCATCCTATACCAGCACTTATTC
## 2 TTTATCTGCAGGAATTGCACATGCTGGGGCTTCTGTAGATATAGGAATCTTTTCACTACATATTGCTGGAGCCTCTTCTATTTTAGGAGCAGTTAATTTCATTACAACTGTAATTAATATGCGATCAGCAGGAATAACTATAGACCGTATCCCCTTATTCGTATGGTCTGTGTTTATTACAGCAATTTTACTTTTATTGTCTCTACCAGTTTTAGCTGGAGCAATCACAATACTTTTAACAGACCGTAATTTAAATACATCTTTCTTCGACCCTGCCGGGGGTGGTGACCCTATTTTATATCAACATCTATTC
## 3 TCTATCAGGTGCTCAAACTCATTCTGGAGGTTCTGTAGATATGGCTATTTTTAGTTTACATTGTGCAGGTGCTTCATCAATTATGGGTGCTATAAATTTTATTACCACTATTTTTAATATGAGAGCTCCTGGATTAACAATGGATAAATTACCATTATTTGTATGATCAGTTTTAATTACTGCTTTTCTATTATTATTATCTCTACCTGTACTAGCAGGTGCAATAACAATGTTATTAACAGACAGAAATTTCAATACAACATTTTTTGATCCAGCAGGTGGTGGTGATCCTGTTTTATATCAACATTTATTC
## 4 CTTATCAAGAAATATCGCCCATGCTGGCGCGTCTGTAGATTTTGCAATTTTTTCTCTCCATTTAGCAGGAGTAAGCTCGATCTTAGGCGCCGTGAATTTTATTAGAACCTTAGGAAATCTGCGAGTATTTGGAATAATTCTAGATCGTATGCCCCTGTTCGCATGGGCAGTTCTAATTACGGCTGTACTATTACTCCTCTCCTTGCCAGTTTTGGCAGGGGCTATCACAATACTTCTCACTGACCGTAACTTAAATACCACATTTTACGATGTGGGGGGGGGTGGAGATCCTATTCTATATCAACATTTATTT
## 5 ACTAAGCTCGATTCAGGCTCATTCCGGAGGTTCGGTGGATATGGCTATATTTAGTTTACATTTGGCAGGGGCTTCTTCTATCATGGGTGCTATTAATTTTATTACTACCATTCTGAACATGAGAGCCCCTGGAATGACAATGGATAAGATACCCCTATTTGTTTGGTCTGTATTAGTTACTGCAATATTATTATTGTTATCCTTGCCTGTTTTGGCTGGAGCAATAACCATGTTACTAACCGACAGGAACTTTAATACATCTTTTTTCGATCCTGCTGGGGGAGGAGACCCTATTTTGTTTCAACATCTTTTT
## 6 CTTATCAAGAAATATCGCCCATGCTGGCGCGTCTGTAGATTTTGCAATTTTTTCTCTCCATTTAGCAGGAGTAAGCTCGATCTTAGGCGCCGTGAATTTTATTAGAACCTTAGGAAATCTGCGAGTATTTGGAATAATTCTAGATCGTATGCCCCTGTTCGCATGGGCAGTTCTAATTACGGCTGTACTATTACTCCTCTCCTTGCCAGTTTTGGCAGGGGCTATCACAATACTTCTCACTGACCGTAACTTAAATACCACATTTTACGATGTGGGGGGGGGTGGAGATCCTATCCTATACCAACATTTATTT
## abundance forward reverse nmatch nmismatch nindel prefer accept
## 1 17450 1 2 234 0 0 1 TRUE
## 2 11522 2 1 233 0 0 1 TRUE
## 3 3288 3 3 235 0 0 1 TRUE
## 4 2661 4 7 235 0 0 1 TRUE
## 5 1627 5 5 233 0 0 1 TRUE
## 6 1291 4 11 235 0 0 1 TRUE
seqtab <- makeSequenceTable(mergers)
inspect sequence length distributions eventually make histogram here
table(nchar(getSequences(seqtab)))
##
## 151 206 215 234 253 255 265 268 269 283 286 288 291 294 295 296
## 1 1 1 1 1 1 1 1 1 3 1 1 1 1 3 1
## 297 298 300 301 302 303 304 306 307 308 309 310 311 312 313 314
## 2 7 3 1 1 7 1 1 2 6 2 258 17 44 1526 17
## 315 316 317 318 321 329 352
## 7 29 6 1 1 1 1
seqtab.nochim <- removeBimeraDenovo(seqtab, method="consensus", multithread=TRUE, verbose=TRUE)
evaluate percentage of reads were chimeric
sum(seqtab.nochim)/sum(seqtab)
## [1] 0.9818287
chimeras account for very low perentage of reads
track how many reads made it through the pipeline
getN <- function(x) sum(getUniques(x))
track <- cbind(out, sapply(dadaFs, getN), sapply(dadaRs, getN), sapply(mergers, getN), rowSums(seqtab.nochim))
colnames(track) <- c("input", "filtered", "denoisedF", "denoisedR", "merged", "nonchim")
rownames(track) <- sample.names
head(track)
no step had a majority of the reads removed The step that creates the out table takes too long to run, so a screenshot of this table is included below
screenshot of tracking table
#change directory
cd /Users/hailaschultz/GitHub/haila-ZoopMetabarcoding/software
#download software
curl -O https://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/https://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/ncbi-blast-2.14.0+-x64-macosx.tar.gz
unzip software
cd /Users/hailaschultz/GitHub/haila-ZoopMetabarcoding/software
tar -xzf ncbi-blast-2.14.0+-x64-macosx.tar.gz
Check if it’s working I had to go into my computer settings and give permissions to use blast because it is from and “unidentified developer”
/Users/hailaschultz/GitHub/haila-ZoopMetabarcoding/software/ncbi-blast-2.14.0+/bin/blastx -h
## USAGE
## blastx [-h] [-help] [-import_search_strategy filename]
## [-export_search_strategy filename] [-task task_name] [-db database_name]
## [-dbsize num_letters] [-gilist filename] [-seqidlist filename]
## [-negative_gilist filename] [-negative_seqidlist filename]
## [-taxids taxids] [-negative_taxids taxids] [-taxidlist filename]
## [-negative_taxidlist filename] [-ipglist filename]
## [-negative_ipglist filename] [-entrez_query entrez_query]
## [-db_soft_mask filtering_algorithm] [-db_hard_mask filtering_algorithm]
## [-subject subject_input_file] [-subject_loc range] [-query input_file]
## [-out output_file] [-evalue evalue] [-word_size int_value]
## [-gapopen open_penalty] [-gapextend extend_penalty]
## [-qcov_hsp_perc float_value] [-max_hsps int_value]
## [-xdrop_ungap float_value] [-xdrop_gap float_value]
## [-xdrop_gap_final float_value] [-searchsp int_value]
## [-sum_stats bool_value] [-max_intron_length length] [-seg SEG_options]
## [-soft_masking soft_masking] [-matrix matrix_name]
## [-threshold float_value] [-culling_limit int_value]
## [-best_hit_overhang float_value] [-best_hit_score_edge float_value]
## [-subject_besthit] [-window_size int_value] [-ungapped] [-lcase_masking]
## [-query_loc range] [-strand strand] [-parse_deflines]
## [-query_gencode int_value] [-outfmt format] [-show_gis]
## [-num_descriptions int_value] [-num_alignments int_value]
## [-line_length line_length] [-html] [-sorthits sort_hits]
## [-sorthsps sort_hsps] [-max_target_seqs num_sequences]
## [-num_threads int_value] [-mt_mode int_value] [-remote]
## [-comp_based_stats compo] [-use_sw_tback] [-version]
##
## DESCRIPTION
## Translated Query-Protein Subject BLAST 2.14.0+
##
## Use '-help' to print detailed descriptions of command line arguments
Download the reference. The reference used here is from metazoogene and is the file for all marine fauna and flora of the world oceans
#download from url
cd /Users/hailaschultz/GitHub/haila-ZoopMetabarcoding/data
curl -O https://www.st.nmfs.noaa.gov/copepod/collaboration/metazoogene/atlas/data-src/MZGfasta-coi__MZGdbALL__o07__B.fasta
#unzip
gunzip -k MZGfasta-coi__MZGdbALL__o00__A.fasta
Make the database Since we are evaluating nucleotide seqeunces, make sure you are using -dbtype nucl
/Users/hailaschultz/GitHub/haila-ZoopMetabarcoding/software/ncbi-blast-2.14.0+/bin/makeblastdb \
-in /Users/hailaschultz/GitHub/haila-ZoopMetabarcoding/data/MZGfasta-coi__MZGdbALL__o00__A.fasta \
-dbtype nucl \
-out /Users/hailaschultz/GitHub/haila-ZoopMetabarcoding/blastdb/MZGfasta-coi__MZGdbALL__o00__A
transpose table
seqtab.nochim_trans <- as.data.frame(t(seqtab.nochim)) %>% rownames_to_column(var = "sequence") %>%
rowid_to_column(var = "OTUNumber") %>% mutate(OTUNumber = sprintf("otu%04d",
OTUNumber)) %>% mutate(sequence = str_replace_all(sequence, "(-|\\.)", ""))
convert to fasta file
df <- seqtab.nochim_trans
seq_out <- Biostrings::DNAStringSet(df$sequence)
names(seq_out) <- str_c(df$OTUNumber, df$Supergroup, df$Division, df$Class,
df$Order, df$Family, df$Genus, df$Species, sep = "|")
Biostrings::writeXStringSet(seq_out, str_c( "Zoop_ASV.fasta"), compress = FALSE,
width = 20000)
#I had to move this file from my working directory to my data directory
examine fasta file
head /Users/hailaschultz/GitHub/haila-ZoopMetabarcoding/data/Zoop_ASV.fasta
## >otu0001
## TTTATCTGCAGGAATTGCACATGCTGGGGCTTCTGTAGATATAGGAATCTTTTCACTACATATTGCTGGAGCCTCTTCTATTTTAGGAGCAGTTAATTTCATTACAACTGTAATTAATATGCGATCAGCAGGAATAACTATAGACCGTATCCCCTTATTCGTATGGTCTGTGTTTATTACAGCAATTTTACTTTTATTGTCTCTACCAGTTTTAGCTGGAGCAATCACAATACTTTTAACAGACCGTAATTTAAATACATCTTTCTTCGACCCTGCCGGGGGTGGTGACCCTATTTTATATCAACATCTATTC
## >otu0002
## CTTATCAAGAAATATCGCCCATGCTGGCGCGTCTGTAGATTTTGCAATTTTTTCTCTCCATTTAGCAGGAGTAAGCTCGATCTTAGGCGCCGTGAATTTTATTAGAACCTTAGGAAATCTGCGAGTATTTGGAATAATTCTAGATCGTATGCCCCTGTTCGCATGGGCAGTTCTAATTACGGCTGTACTATTACTCCTCTCCTTGCCAGTTTTGGCAGGGGCTATCACAATACTTCTCACTGACCGTAACTTAAATACCACATTTTACGATGTGGGGGGGGGTGGAGATCCTATTCTATATCAACATTTATTT
## >otu0003
## CTTATCAAGAAATATCGCCCATGCTGGCGCGTCTGTAGATTTTGCAATTTTTTCTCTCCATTTAGCAGGAGTAAGCTCGATCTTAGGCGCCGTGAATTTTATTAGAACCTTAGGAAATCTGCGAGTATTTGGAATAATTCTAGATCGTATGCCCCTGTTCGCATGGGCAGTTCTAATTACGGCTGTACTATTACTCCTCTCCTTGCCAGTTTTGGCAGGGGCTATCACAATACTTCTCACTGACCGTAACTTAAATACCACATTTTACGATGTGGGGGGGGGTGGAGATCCTATCCTATACCAACATTTATTT
## >otu0004
## CTTATCAAGAAATATCGCCCATGCTGGCGCGTCTGTAGATTTTGCAATTTTTTCTCTCCATTTAGCAGGAGTAAGCTCGATCTTAGGCGCCGTGAATTTTATTAGAACCCTAGGAAATCTGCGAGTATTTGGAATAATTCTAGATCGTATGCCCCTGTTCGCATGGGCAGTTCTAATTACGGCTGTACTATTACTCCTCTCCTTGCCAGTTTTGGCAGGGGCTATCACAATACTTCTCACTGACCGTAACTTAAATACCACATTTTACGATGTGGGGGGGGGCGGAGATCCTATTCTATATCAACATTTATTT
## >otu0005
## CTTATCAGGTTCACAAACCCATTCAGGAGGATCAGTAGATATGGCTATATTTAGTTTACATTGTGCTGGTGCATCTTCTATTATGGGAGCAATAAATTTTATTACCACAATTTTCAATATGAGAGCCCCAGGATTAACTTTAGATAAATTACCATTATTCGTATGATCTGTATTAATCACTGCTTTCTTATTATTATTATCTCTACCTGTGTTAGCCGGTGCTATAACAATGTTGTTAACCGACAGAAATTTCAATACTACATTTTTTGATCCTGCAGGAGGTGGAGATCCTGTATTATACCAACATTTATTT
Since we are looking at nucleotides, make sure you are using blastn.
/Users/hailaschultz/GitHub/haila-ZoopMetabarcoding/software/ncbi-blast-2.14.0+/bin/blastn \
-query /Users/hailaschultz/GitHub/haila-ZoopMetabarcoding/data/Zoop_ASV.fasta \
-db /Users/hailaschultz/GitHub/haila-ZoopMetabarcoding/blastdb/MZGfasta-coi__MZGdbALL__o00__A \
-out /Users/hailaschultz/GitHub/haila-ZoopMetabarcoding/output/Zoop_ASV.tab \
-evalue 1E-20 \
-num_threads 8 \
-max_target_seqs 1 \
-outfmt 6
Examine blast output
head -2 ../output/Zoop_ASV.tab
wc -l ../output/Zoop_ASV.tab
## otu0001 QHAK892-21__Aegina_citrea 100.000 313 0 0 1 313 346 658 3.64e-164 579
## otu0002 CAISN473-13__Calanus_pacificus 100.000 313 0 0 1 313 345 657 3.64e-164 579
## 1849 ../output/Zoop_ASV.tab
These are species we expect to see - looks like blast worked!
merge Blast IDs with OTU table
#change directory
setwd("/Users/hailaschultz/GitHub/haila-ZoopMetabarcoding/output")
#read ASV table into R
Zoop_ASV<-read.table("Zoop_ASV.tab")
#rename columns
colnames(Zoop_ASV) = c("OTUNumber", "Species", "pident", "length","mismatch","gapopen","qstart","qend","sstart","send","evalue","bitscore")
#merge tables by otu number
read_table<-left_join(Zoop_ASV, seqtab.nochim_trans, by = join_by("OTUNumber" == "OTUNumber"))
datatable(read_table, fillContainer = T)
Visualize part of the table here
Filter hits
# remove ASV with sequences shorter than 300 bp
read_table_sub <- subset(read_table, read_table$length>300)
#remove ASV with pident <95%
read_table_sub2 <- subset(read_table_sub, read_table_sub$pident>98)
Sum Sequences by Taxa
#remove unneeded columns
read_table_summed<- read_table_sub2[ -c(1,3:13) ]
#remove prefix
read_table_summed$Species <- sub(".*__", "", read_table_summed$Species)
#summarize by species
by_species <- read_table_summed %>%
group_by(Species)
read_table_species<-by_species %>%
summarise_all(sum)
datatable(read_table_species, fillContainer = T)
write read table to .csv
write.csv(read_table_species, "/Users/hailaschultz/GitHub/haila-ZoopMetabarcoding/output/read_table.csv", row.names=FALSE)
# Transpose data
read_table_transposed <- t(read_table_species)
datatable(read_table_transposed, fillContainer = T)
Prep table for NMDS
# get row and colnames in order
rownames(read_table_transposed) <- colnames(read_table_species)
#move first row to column names
read_table_final<-row_to_names(read_table_transposed, 1, remove_rows_above = FALSE)
#convert to matrix
community_matrix<-as.matrix(read_table_final)
#convert to numeric matrix
community_matrix<- matrix(as.numeric(community_matrix),
ncol = ncol(community_matrix))
#convert to proportions
community_matrix<-community_matrix/rowSums(community_matrix)
#arcsine sqrt transformation
community_matrix_sqrt<-asin(sqrt(community_matrix))
Run NMDS
#run NMDS
NMDS=metaMDS(community_matrix_sqrt,distance="bray",trymax=100)
## Run 0 stress 0.1899421
## Run 1 stress 0.1960882
## Run 2 stress 0.1892941
## ... New best solution
## ... Procrustes: rmse 0.06439918 max resid 0.3391543
## Run 3 stress 0.1893151
## ... Procrustes: rmse 0.001927401 max resid 0.01091176
## Run 4 stress 0.1952756
## Run 5 stress 0.1994433
## Run 6 stress 0.1960878
## Run 7 stress 0.2094945
## Run 8 stress 0.1959399
## Run 9 stress 0.2056893
## Run 10 stress 0.1892941
## ... New best solution
## ... Procrustes: rmse 1.719537e-05 max resid 5.140785e-05
## ... Similar to previous best
## Run 11 stress 0.1832883
## ... New best solution
## ... Procrustes: rmse 0.03756408 max resid 0.240882
## Run 12 stress 0.2072553
## Run 13 stress 0.2012439
## Run 14 stress 0.1893149
## Run 15 stress 0.1832884
## ... Procrustes: rmse 2.77944e-05 max resid 0.0001292051
## ... Similar to previous best
## Run 16 stress 0.1892941
## Run 17 stress 0.2026736
## Run 18 stress 0.2052709
## Run 19 stress 0.1899421
## Run 20 stress 0.1959399
## *** Best solution repeated 1 times
stressplot(NMDS)
plot(NMDS)
NMDS
##
## Call:
## metaMDS(comm = community_matrix_sqrt, distance = "bray", trymax = 100)
##
## global Multidimensional Scaling using monoMDS
##
## Data: community_matrix_sqrt
## Distance: bray
##
## Dimensions: 2
## Stress: 0.1832883
## Stress type 1, weak ties
## Best solution was repeated 1 time in 20 tries
## The best solution was from try 11 (random start)
## Scaling: centring, PC rotation, halfchange scaling
## Species: expanded scores based on 'community_matrix_sqrt'
Make envrionmental table
#export file names
env <- as.data.frame(row.names(read_table_final))
#change column name
colnames(env)[colnames(env) == "row.names(read_table_final)"] <- "file"
#create year column
env$year <- substr(env$file, 1, 4)
#create month column
env$month <- substr(env$file, 5, 7)
#create station column
env$station <- str_extract(env$file, "P12|P22|P28|P38|P402|P4|P8")
#create run column
env$run <- str_extract(env$file, "48samples|-P1|-P2")
extract scores
data.scores = as.data.frame(scores(NMDS)$sites)
#add environmental columns
data.scores$station = env$station
data.scores$year = env$year
data.scores$month = env$month
data.scores$run = env$run
make hulls
P12 <- data.scores[data.scores$station == "P12", ][chull(data.scores[data.scores$station ==
"P12", c("NMDS1", "NMDS2")]), ]
P22 <- data.scores[data.scores$station == "P22", ][chull(data.scores[data.scores$station ==
"P22", c("NMDS1", "NMDS2")]), ]
P28 <- data.scores[data.scores$station == "P28", ][chull(data.scores[data.scores$station ==
"P28", c("NMDS1", "NMDS2")]), ]
P38 <- data.scores[data.scores$station == "P38", ][chull(data.scores[data.scores$station ==
"P38", c("NMDS1", "NMDS2")]), ]
P402 <- data.scores[data.scores$station == "P402", ][chull(data.scores[data.scores$station ==
"P402", c("NMDS1", "NMDS2")]), ]
P4 <- data.scores[data.scores$station == "P4", ][chull(data.scores[data.scores$station ==
"P4", c("NMDS1", "NMDS2")]), ]
P8 <- data.scores[data.scores$station == "P8", ][chull(data.scores[data.scores$station ==
"P8", c("NMDS1", "NMDS2")]), ]
hull.data <- rbind(P12,P22,P28,P38,P402,P4,P8)
hull.data
## NMDS1 NMDS2 station year month run
## 2 -0.112186590 -0.20038331 P12 2018 APR -P1
## 1 -0.004621414 -0.46531672 P12 2018 APR 48samples
## 34 -0.067219247 -0.48243337 P12 2019 APR 48samples
## 28 -0.648794815 -0.25970112 P12 2018 SEP -P2
## 15 -0.729772635 0.19024808 P12 2018 JUL -P1
## 14 -0.622680222 0.16189395 P12 2018 JUL 48samples
## 35 1.079475184 0.48107118 P22 2019 APR -P1
## 4 0.667278931 0.40447979 P22 2018 APR -P1
## 29 -0.097041549 0.43460112 P22 2018 SEP -P2
## 48 -0.258342116 0.57004444 P22 2019 JUL 48samples
## 17 0.212365195 0.64240115 P22 2018 JUL -P1
## 3 0.813363720 0.62487343 P22 2018 APR 48samples
## 5 0.177077983 0.12877803 P28 2018 APR 48samples
## 38 0.165371958 -0.17471828 P28 2019 APR -P1
## 30 -0.336447310 0.16148928 P28 2018 SEP -P2
## 18 -0.571945077 0.49853483 P28 2018 JUL 48samples
## 19 -0.523413743 0.56877836 P28 2018 JUL -P1
## 6 0.010912322 0.28216640 P28 2018 APR -P1
## 21 -0.233613126 -0.36011027 P38 2018 JUL -P1
## 31 -0.699579613 -0.04538073 P38 2018 SEP -P2
## 20 -0.563789366 0.06512851 P38 2018 JUL 48samples
## 22 -0.206922933 -0.06308197 P38 2018 JUL -P2
## 8 0.591186080 -0.87070343 P402 2018 APR 48samples
## 9 0.548797962 -0.94943279 P402 2018 APR -P1
## 32 -0.521996522 -0.79012961 P402 2018 SEP -P2
## 23 -0.566423990 0.18445570 P402 2018 JUL -P2
## 11 0.844780444 -0.44981442 P4 2018 APR -P1
## 10 0.666934421 -0.47573956 P4 2018 APR 48samples
## 33 -0.481717290 -0.52361003 P4 2018 SEP -P2
## 25 -0.562367133 0.33926174 P4 2018 JUL -P1
## 13 0.889512353 0.06190187 P8 2018 APR -P1
## 43 0.851043590 -0.06687479 P8 2019 APR 48samples
## 45 0.760543725 -0.08054055 P8 2019 APR -P1
## 44 0.579301397 -0.08084995 P8 2019 APR -P1
## 26 -0.454892841 0.15837750 P8 2018 JUL 48samples
## 27 0.114730656 0.30994575 P8 2018 JUL -P1
## 12 0.699510275 0.15350067 P8 2018 APR 48samples
station_plot = ggplot(data = data.scores, aes(x = NMDS1, y = NMDS2))+
geom_polygon(data=hull.data,aes(x=NMDS1,y=NMDS2,fill=station,group=station),alpha=0.30) +
geom_point(data = data.scores, aes(colour = station), size = 3, alpha = 0.5) +
scale_colour_manual(values = c("orange", "steelblue","darkgreen","violet","red","darkblue","limegreen")) +
scale_fill_manual(values = c("orange", "steelblue","darkgreen","violet","red","darkblue","limegreen"))+
theme(axis.title = element_text(size = 10, face = "bold", colour = "grey30"),
panel.background = element_blank(), panel.border = element_rect(fill = NA, colour = "grey30"),
axis.ticks = element_blank(), axis.text = element_blank(), legend.key = element_blank(),
legend.title = element_text(size = 10, face = "bold", colour = "grey30"),
legend.text = element_text(size = 9, colour = "grey30")) +
labs(colour = "Station")
station_plot
setwd("/Users/hailaschultz/GitHub/haila-ZoopMetabarcoding/plots")
ggsave(plot = station_plot, width = 5, height = 3, dpi = 300, filename = "station_plot.png")
There is quite a bit of overlap among stations, but you can see that some stations are distinct from one another. For example, P22 is different from P38, but P38 and P12 are very similar.
create year plot make hulls
y2018 <- data.scores[data.scores$year == "2018", ][chull(data.scores[data.scores$year ==
"2018", c("NMDS1", "NMDS2")]), ]
y2019 <- data.scores[data.scores$year == "2019", ][chull(data.scores[data.scores$year ==
"2019", c("NMDS1", "NMDS2")]), ]
hull.data.2 <- rbind(y2018,y2019)
hull.data.2
## NMDS1 NMDS2 station year month run
## 11 0.84478044 -0.44981442 P4 2018 APR -P1
## 9 0.54879796 -0.94943279 P402 2018 APR -P1
## 32 -0.52199652 -0.79012961 P402 2018 SEP -P2
## 28 -0.64879482 -0.25970112 P12 2018 SEP -P2
## 31 -0.69957961 -0.04538073 P38 2018 SEP -P2
## 15 -0.72977264 0.19024808 P12 2018 JUL -P1
## 18 -0.57194508 0.49853483 P28 2018 JUL 48samples
## 19 -0.52341374 0.56877836 P28 2018 JUL -P1
## 17 0.21236519 0.64240115 P22 2018 JUL -P1
## 3 0.81336372 0.62487343 P22 2018 APR 48samples
## 13 0.88951235 0.06190187 P8 2018 APR -P1
## 43 0.85104359 -0.06687479 P8 2019 APR 48samples
## 34 -0.06721925 -0.48243337 P12 2019 APR 48samples
## 46 -0.51833742 -0.24861449 P12 2019 JUL 48samples
## 48 -0.25834212 0.57004444 P22 2019 JUL 48samples
## 35 1.07947518 0.48107118 P22 2019 APR -P1
year_plot = ggplot(data = data.scores, aes(x = NMDS1, y = NMDS2)) +
geom_polygon(data=hull.data.2,aes(x=NMDS1,y=NMDS2,fill=year,group=year),alpha=0.30)+
geom_point(data = data.scores, aes(colour = year), size = 3, alpha = 0.5) +
scale_colour_manual(values = c("orange", "steelblue")) +
scale_fill_manual(values = c("orange", "steelblue")) +
theme(axis.title = element_text(size = 10, face = "bold", colour = "grey30"),
panel.background = element_blank(), panel.border = element_rect(fill = NA, colour = "grey30"),
axis.ticks = element_blank(), axis.text = element_blank(), legend.key = element_blank(),
legend.title = element_text(size = 10, face = "bold", colour = "grey30"),
legend.text = element_text(size = 9, colour = "grey30")) +
labs(colour = "Year")
year_plot
setwd("/Users/hailaschultz/GitHub/haila-ZoopMetabarcoding/plots")
ggsave(plot = year_plot, width = 5, height = 3, dpi = 300, filename = "year_plot.png")
between these two years, there doesn’t seem to be much difference
create month plot make hulls
APR <- data.scores[data.scores$month == "APR", ][chull(data.scores[data.scores$month ==
"APR", c("NMDS1", "NMDS2")]), ]
JUL <- data.scores[data.scores$month == "JUL", ][chull(data.scores[data.scores$month ==
"JUL", c("NMDS1", "NMDS2")]), ]
SEP <- data.scores[data.scores$month == "SEP", ][chull(data.scores[data.scores$month ==
"SEP", c("NMDS1", "NMDS2")]), ]
hull.data.3 <- rbind(APR,JUL,SEP)
hull.data.3
## NMDS1 NMDS2 station year month run
## 11 0.84478044 -0.44981442 P4 2018 APR -P1
## 9 0.54879796 -0.94943279 P402 2018 APR -P1
## 34 -0.06721925 -0.48243337 P12 2019 APR 48samples
## 41 -0.41125703 -0.13959010 P38 2019 APR -P1
## 6 0.01091232 0.28216640 P28 2018 APR -P1
## 3 0.81336372 0.62487343 P22 2018 APR 48samples
## 35 1.07947518 0.48107118 P22 2019 APR -P1
## 21 -0.23361313 -0.36011027 P38 2018 JUL -P1
## 46 -0.51833742 -0.24861449 P12 2019 JUL 48samples
## 15 -0.72977264 0.19024808 P12 2018 JUL -P1
## 18 -0.57194508 0.49853483 P28 2018 JUL 48samples
## 19 -0.52341374 0.56877836 P28 2018 JUL -P1
## 17 0.21236519 0.64240115 P22 2018 JUL -P1
## 27 0.11473066 0.30994575 P8 2018 JUL -P1
## 32 -0.52199652 -0.79012961 P402 2018 SEP -P2
## 28 -0.64879482 -0.25970112 P12 2018 SEP -P2
## 31 -0.69957961 -0.04538073 P38 2018 SEP -P2
## 29 -0.09704155 0.43460112 P22 2018 SEP -P2
month_plot = ggplot(data = data.scores, aes(x = NMDS1, y = NMDS2)) +
geom_polygon(data=hull.data.3,aes(x=NMDS1,y=NMDS2,fill=month,group=month,alpha=0.30))+
geom_point(data = data.scores, aes(colour = month), size = 3, alpha = 0.5) +
scale_colour_manual(values = c("orange", "steelblue","darkgreen")) +
scale_fill_manual(values = c("orange", "steelblue","darkgreen")) +
theme(axis.title = element_text(size = 10, face = "bold", colour = "grey30"),
panel.background = element_blank(), panel.border = element_rect(fill = NA, colour = "grey30"),
axis.ticks = element_blank(), axis.text = element_blank(), legend.key = element_blank(),
legend.title = element_text(size = 10, face = "bold", colour = "grey30"),
legend.text = element_text(size = 9, colour = "grey30")) +
labs(colour = "Month")
month_plot
setwd("/Users/hailaschultz/GitHub/haila-ZoopMetabarcoding/plots")
ggsave(plot = month_plot, width = 5, height = 3, dpi = 300, filename = "month_plot.png")
There seem to be some clear seasonal differences along axis 1 - April appears to be distinct
create run plot
run_plot = ggplot(data = data.scores, aes(x = NMDS1, y = NMDS2)) +
geom_point(data = data.scores, aes(colour = run), size = 3, alpha = 0.5) +
scale_colour_manual(values = c("orange", "steelblue","darkgreen")) +
theme(axis.title = element_text(size = 10, face = "bold", colour = "grey30"),
panel.background = element_blank(), panel.border = element_rect(fill = NA, colour = "grey30"),
axis.ticks = element_blank(), axis.text = element_blank(), legend.key = element_blank(),
legend.title = element_text(size = 10, face = "bold", colour = "grey30"),
legend.text = element_text(size = 9, colour = "grey30")) +
labs(colour = "Run")
run_plot
setwd("/Users/hailaschultz/GitHub/haila-ZoopMetabarcoding/plots")
ggsave(plot = run_plot, width = 5, height = 3, dpi = 300, filename = "run_plot.png")
#convert to matrix
diversity<-as.matrix(read_table_final)
#convert to numeric matrix
diversity_matrix<- matrix(as.numeric(diversity),
ncol = ncol(diversity))
#add in row and column labels
colnames(diversity_matrix) <- colnames(diversity)
rownames(diversity_matrix) <- rownames(diversity)
get shannon diversity index
shannon_scores<-as.data.frame(diversity(diversity_matrix, index="shannon"))
add environment columns back in
#change column name
colnames(shannon_scores)[1] <- "shannon_index"
#add environmental columns
shannon_scores$station = env$station
shannon_scores$year = env$year
shannon_scores$month = env$month
shannon_scores$run = env$run
make violin plot
diversity<- ggviolin(shannon_scores, x = "station", y = "shannon_index",
add = "boxplot", fill = "station")
diversity
setwd("/Users/hailaschultz/GitHub/haila-ZoopMetabarcoding/plots")
ggsave(plot = diversity, width = 5, height = 3, dpi = 300, filename = "diversity.png")
Overall, the highest diversity was seen at P22, which is the station located in the Strait of Juan de Fuca. This site is closest to the ocean, so it may contain inland taxa as well as more offshore taxa.
get the top 20 most abundant species
#convert to proportions
diversity_prop<-diversity_matrix/rowSums(diversity_matrix)
species_means <- as.data.frame(colMeans(diversity_prop))
colnames(species_means)[1] <- "abundance"
species_means$abundance<-as.numeric(species_means$abundance)
species_means <- rownames_to_column(species_means, "species")
species_means <-as.data.frame(species_means[order(-species_means$abundance),])
#get top 20 species
top_20_species <- head(species_means, 20)
plot
top20<-ggplot(top_20_species, aes(y=abundance, x=reorder(species, abundance))) +
geom_bar(position="dodge", stat="identity")+ theme_bw()+coord_flip()+ylab("Mean Proportion of Reads")+xlab("species")
top20
setwd("/Users/hailaschultz/GitHub/haila-ZoopMetabarcoding/plots")
ggsave(plot = top20, width = 5, height = 3, dpi = 300, filename = "top20.png")
The samples were dominated by Calanus pacificus (a copepod) and Aegina citrea (a jellyfish). Other abundant species included eutonina indicans and Clytia gregaria, two other jellyfish. Other common copepods like Centropages abdonminalis and Metridia pacifica were also represented