SA.sabkha.combined.qiime2.rmd

0.1 Raw data
0.2 Initial multiplexed single-ended data import
1 Classifier
2 Metadata
3 Pipeline overview
- 3.1 Data statistics following import
4 quality filtering
- 4.1 results
- 4.2 chimera removal
5 Beta-diversity
- 5.1 weighted-UniFrac PCoA ordination
- 5.2 unweighted UniFrac PCoA ordination
6 Alpha-diversity
7 OTU tables
8 Conclusions/future work

The goal of this project is to combine two 454-based projects into one qiime2 analysis. This should be combined prior to the clustering stage. The new challenge presented here is how to take two multiplexed sff files, with redundant barcodes, and get them all into a single project.

convert sff to fastq, i.e.

sff2fastq ../data/031615MA515F.sff > ../reads/asq.fastq

place each into its own folder along with a barcode mapping file, zip, and import

gzip ../reads/asq.fastq

qiime tools import \
--type MultiplexedSingleEndBarcodeInSequence \
--input-path ../reads/asq.fastq.gz \
--output-path ../multiplex.reads.qza

demultiplex using cutadapt

qiime cutadapt demux-single \
--i-seqs ../multiplex.reads.qza \
--m-barcodes-file ../reads/metadata.tsv \
--m-barcodes-column Barcode \
--p-error-rate 0 \
--o-per-sample-sequences ../demultiplexed-seqs.qza \
--o-untrimmed-sequences ../untrimmed.qza \
--verbose

And write the read files, now demultiplexed

qiime tools export \
--input-path demultiplexed-seqs.qza \
--output-path inland.demultiplexed

This is done separateley, outside of the main script.

In summary:

convert sff to fastq
compress the fastq to fastq.gz
demultiplex the reads using cutadapt and a mapping/barcode file as guide
remov sequencing adapters using cutadapt
write reads as separate files

Some additional details below:

0.1 Raw data

Raw data is in “sff” format. attempt conversion using https://bioconda.github.io/recipes/sff2fastq/README.html

Installed into a new conda environment

scrpt for conversion is simple, see scripts

0.2 Initial multiplexed single-ended data import

This part is very tricky! The data in hand is in a format that is generally no longer used.

https://forum.qiime2.org/t/demultiplexing-and-trimming-adapters-from-reads-with-q2-cutadapt/2313

The first part of this tutorial describes how to get this taken care of. The sequences are multiplexed with barcodes still included. You must provide two files 1. raw data in fastq 2. metadata file which maps barcodes to sampleID. This file must be adapted from the old data files, a file called “mapping.txt”

1 Classifier

Silva132 qiime2 classifier downloaded from qiime2 website, using the 515-806 trimmed version. Due to an scikitlearn version conflict, I trained my own classifier locally using standard methods

2 Metadata

A combined metadata table is produced:

com.mdata <- read.csv("Sample_metadata.txt",header = T,sep = "\t")
datatable(com.mdata)

This represents only data which was shared between the two projects.

Some values were at saturation.

The saturation value for K at sample S2.20 was simply guessed at 15,000

3 Pipeline overview

Demultiplexed read files are combined into the same directory

Generally, the pipeline consists of

quality trimming with length filtering
open-reference (vs. silva v132) vsearch clustering at 99.0%
classify with silva classifier qiime2 format
generation of alpha and beta-diveristy analyses
combining and exporting relevant results files

3.1 Data statistics following import

These are harvested from the demux summarize file

per.sample.count <- read.csv("per-sample-fastq-counts.csv")
datatable(per.sample.count)

4 quality filtering

Apply a q-filter of 20, which will trim reads based on a sliding window approach whereby if 3 bp are located at 20 qxcore or below, the read is truncated. This also applies a 75% truncation rule, whereby if 75% of the read must remain after being trimmed in order for the read to be retained.

4.1 results

qfil <- read.csv("qtrim.stats.csv", stringsAsFactors = F)
qfil <- qfil[-1,]
qfil$total.input.reads <- as.numeric(qfil$total.input.reads)
qfil$total.retained.reads <- as.numeric(qfil$total.retained.reads)
datatable(qfil)

Total read count has definitely dropped, by about 50% in most cases.

print(paste( sum(qfil$total.retained.reads)/sum(qfil$total.input.reads) ,"of the reads were retained after quality trimming"))

## [1] "0.501296204281795 of the reads were retained after quality trimming"

There are several steps in the background, see the script for details.

4.2 chimera removal

chim <- read.csv("per-sample-non-chim.csv", stringsAsFactors = F, header = F)
colnames(chim) <- c("sample.id", "non.chimeric")
chim$`non.chimeric` <- as.numeric(chim$`non.chimeric`)
chim <- left_join(qfil,chim)
chim <- chim[-c(4,5,6)]
datatable(chim)

print(paste( sum(chim$non.chimeric)/sum(chim$total.retained.reads) ,"of the reads were retained after chimeara filtering"))

## [1] "0.970027365381715 of the reads were retained after chimeara filtering"

5 Beta-diversity

5.1 weighted-UniFrac PCoA ordination

orange=inland, red=coastal_1, blue=coastal_2

5.2 unweighted UniFrac PCoA ordination

In the unweighted UniFrac based ordination, the coastal and inland sites do not separate cleanly, indicating many shared closely-related taxa. Note the weakness of the axes in the unweighted version (says that community structure differences are less clear, much more multi-factorial). When ordinated based on a weighted distance matrix, which takes relative abundances of taxa into account, the sites become more clear; the coastal columns cluster apart from the inland site, unsurprisingly. In the weighted version, two axes explain over 65% of community structure (vs. 21% in unweighted version). However, there is remarkable overlap/similarity along the main axis of variation between inland and some samples in the coastal sites, meaning that there is an imporant “sub-community” of similar taxa at similar relative abundances. I could not determine any physical/chemical parameters that might explain this similarity.

6 Alpha-diversity

6.1 Observed OTUs

6.2 Shannon Diversity

at 400 sampling depth, diversity is leveled off decently
OTU count (richness) is rising drastically
Inland site is more diverse than coastal sites overall

6.3 OTU-tables

All tables are provied in table format. A broad overview is taken here.

“Phylum level”

The overall difference between inland and coastal is striking, with Firmicutes being the most dominant in inland sites and Proteobacteria dominating coastal. Perhaps explained by water levels and increased salinity.

Difference between the two coastal columns is also obvious, with heavy cyanobacterial influence (pink: this is actually just chloroplast mostly, will have to be screened out?) at site 1 and heavy Euryarchaeota/Gemmatimonades co-dominance at site 2, although this is not a consistent trend.

6.4 Interesting observations

6.4.1 Gemmatimodates

D_0__Bacteria;D_1__Gemmatimonadetes;D_2__PAUC43f marine benthic group;D_3__uncultured bacterium;D_4__uncultured bacterium;D_5__uncultured bacterium

This taxon is a major player in coastal column 2 (up to 30%) and also shows up in top two layers of inland sample, making up 2-7% of the community. The actual sequences have only a partial overal (mostly different OTUs). This is a Class-level taxon, so the underlying sequences themselves might be quite different, there is not way to tell without deeper phylogenetic analysis. This clearly shows a marine influence on the inland sites. Other parallels might exist. Especially interesting would be other clearly marine taxa being found to lesser degrees in the inland site.

Hypothesis

Coastal site 2 is closer to the inland site than coastal site 1. Phylogeny of inland site PAUC43f is close to coastal site 2.

Additionally, there appears to be a correlation between PAUC43F and Halobacteria (an Archael class)

gem.halo <- read.csv("analysis/PAUC43.halobacteria.csv")
library(ggplot2)
ggplot(gem.halo, aes(x=PAUC43f, y=Halobacteria)) + geom_point()

The pattern is not as striking when plotted perhaps, but is worthy of mathematical analysis.

6.5 Additional observations

casual observations on particular taxa, fishing for stories

6.5.1 Herbaspirillum

Found almost exclusively in the coastal sites and is a major player in most samples

6.5.2 Escherichia-Shigella

Found almost exclusively in inland site, 5% in two of the samples, maybe indicates fecal influence?

6.5.3 Alcanivorax

Found in uppermost layer of inland site and throughout coastal site 1. Generally considered a marine bacterium. Perhaps another instance of marine influence

6.5.4 Acinetobacter

Same OTUs are found at all sites

6.5.5 Pseudomonas

Very important in coastal sites but largely irrelevant inland.

6.5.6 Staphylococcus and Streptococcus

Both are very important in inland sites only, although same OTUs are found across the inland sites

6.5.7 Lawsonella

Two OTUs are imporant in inland sites 4 and 6 and also found throughout the coastal sites

6.5.8 Corynebacterium

one OTU imporant in inland sites 2 and 4 and also throughout coastal sites

6.5.9 Acetothermia

Found at inland sites 1,2,3 and at low numbers coastal, but different OTUs

6.5.10 Hadesarchaeota

Entirely unique to the inland site, especially top two depths

6.5.11 Halobacteria

Enormous diversity, important coastal, not present inland

7 OTU tables

Provided in data packet (not shown here)

8 Conclusions/future work

We should consider finding a way to filter out chloroplasts at some point in the analysis. This does not inform the community analysis, might actually dominate one of the axes of ordination.
There is a way to pull out sub-communities, to see which taxa correlate with particular axes of variation. I don’t know how to do this though, its worth looking into perhaps? We could more easily find the “core community” among the sites represented by the left-side clustering of samples on axis 1 in the weighted unifrac ordination.