SA.sabkha.combined.qiime2.rmd
SA.sabkha.combined.qiime2.rmd
The goal of this project is to combine two 454-based projects into one qiime2 analysis. This should be combined prior to the clustering stage. The new challenge presented here is how to take two multiplexed sff files, with redundant barcodes, and get them all into a single project.
convert sff to fastq, i.e.
sff2fastq ../data/031615MA515F.sff > ../reads/asq.fastqplace each into its own folder along with a barcode mapping file, zip, and import
gzip ../reads/asq.fastq
qiime tools import \
--type MultiplexedSingleEndBarcodeInSequence \
--input-path ../reads/asq.fastq.gz \
--output-path ../multiplex.reads.qza
- demultiplex using cutadapt
qiime cutadapt demux-single \
--i-seqs ../multiplex.reads.qza \
--m-barcodes-file ../reads/metadata.tsv \
--m-barcodes-column Barcode \
--p-error-rate 0 \
--o-per-sample-sequences ../demultiplexed-seqs.qza \
--o-untrimmed-sequences ../untrimmed.qza \
--verbose
And write the read files, now demultiplexed
qiime tools export \
--input-path demultiplexed-seqs.qza \
--output-path inland.demultiplexed
This is done separateley, outside of the main script.
In summary:
- convert sff to fastq
- compress the fastq to fastq.gz
- demultiplex the reads using cutadapt and a mapping/barcode file as guide
- remov sequencing adapters using cutadapt
- write reads as separate files
Some additional details below:
0.1 Raw data
Raw data is in “sff” format. attempt conversion using https://bioconda.github.io/recipes/sff2fastq/README.html
Installed into a new conda environment
scrpt for conversion is simple, see scripts
0.2 Initial multiplexed single-ended data import
This part is very tricky! The data in hand is in a format that is generally no longer used.
https://forum.qiime2.org/t/demultiplexing-and-trimming-adapters-from-reads-with-q2-cutadapt/2313
The first part of this tutorial describes how to get this taken care of. The sequences are multiplexed with barcodes still included. You must provide two files 1. raw data in fastq 2. metadata file which maps barcodes to sampleID. This file must be adapted from the old data files, a file called “mapping.txt”
1 Classifier
Silva132 qiime2 classifier downloaded from qiime2 website, using the 515-806 trimmed version. Due to an scikitlearn version conflict, I trained my own classifier locally using standard methods
2 Metadata
A combined metadata table is produced:
com.mdata <- read.csv("Sample_metadata.txt",header = T,sep = "\t")
datatable(com.mdata)This represents only data which was shared between the two projects.
Some values were at saturation.
The saturation value for K at sample S2.20 was simply guessed at 15,000
3 Pipeline overview
Demultiplexed read files are combined into the same directory
Generally, the pipeline consists of
- quality trimming with length filtering
- open-reference (vs. silva v132) vsearch clustering at 99.0%
- classify with silva classifier qiime2 format
- generation of alpha and beta-diveristy analyses
- combining and exporting relevant results files
3.1 Data statistics following import
These are harvested from the demux summarize file
per.sample.count <- read.csv("per-sample-fastq-counts.csv")
datatable(per.sample.count)4 quality filtering
Apply a q-filter of 20, which will trim reads based on a sliding window approach whereby if 3 bp are located at 20 qxcore or below, the read is truncated. This also applies a 75% truncation rule, whereby if 75% of the read must remain after being trimmed in order for the read to be retained.
4.1 results
qfil <- read.csv("qtrim.stats.csv", stringsAsFactors = F)
qfil <- qfil[-1,]
qfil$total.input.reads <- as.numeric(qfil$total.input.reads)
qfil$total.retained.reads <- as.numeric(qfil$total.retained.reads)
datatable(qfil)Total read count has definitely dropped, by about 50% in most cases.
print(paste( sum(qfil$total.retained.reads)/sum(qfil$total.input.reads) ,"of the reads were retained after quality trimming"))## [1] "0.501296204281795 of the reads were retained after quality trimming"
There are several steps in the background, see the script for details.
4.2 chimera removal
chim <- read.csv("per-sample-non-chim.csv", stringsAsFactors = F, header = F)
colnames(chim) <- c("sample.id", "non.chimeric")
chim$`non.chimeric` <- as.numeric(chim$`non.chimeric`)
chim <- left_join(qfil,chim)
chim <- chim[-c(4,5,6)]
datatable(chim)print(paste( sum(chim$non.chimeric)/sum(chim$total.retained.reads) ,"of the reads were retained after chimeara filtering"))## [1] "0.970027365381715 of the reads were retained after chimeara filtering"
5 Beta-diversity
5.1 weighted-UniFrac PCoA ordination
orange=inland, red=coastal_1, blue=coastal_2
5.2 unweighted UniFrac PCoA ordination
In the unweighted UniFrac based ordination, the coastal and inland sites do not separate cleanly, indicating many shared closely-related taxa. Note the weakness of the axes in the unweighted version (says that community structure differences are less clear, much more multi-factorial). When ordinated based on a weighted distance matrix, which takes relative abundances of taxa into account, the sites become more clear; the coastal columns cluster apart from the inland site, unsurprisingly. In the weighted version, two axes explain over 65% of community structure (vs. 21% in unweighted version). However, there is remarkable overlap/similarity along the main axis of variation between inland and some samples in the coastal sites, meaning that there is an imporant “sub-community” of similar taxa at similar relative abundances. I could not determine any physical/chemical parameters that might explain this similarity.
6 Alpha-diversity
6.1 Observed OTUs
6.2 Shannon Diversity
- at 400 sampling depth, diversity is leveled off decently
- OTU count (richness) is rising drastically
- Inland site is more diverse than coastal sites overall
6.3 OTU-tables
All tables are provied in table format. A broad overview is taken here.
“Phylum level”
The overall difference between inland and coastal is striking, with Firmicutes being the most dominant in inland sites and Proteobacteria dominating coastal. Perhaps explained by water levels and increased salinity.
Difference between the two coastal columns is also obvious, with heavy cyanobacterial influence (pink: this is actually just chloroplast mostly, will have to be screened out?) at site 1 and heavy Euryarchaeota/Gemmatimonades co-dominance at site 2, although this is not a consistent trend.
6.4 Interesting observations
6.4.1 Gemmatimodates
D_0__Bacteria;D_1__Gemmatimonadetes;D_2__PAUC43f marine benthic group;D_3__uncultured bacterium;D_4__uncultured bacterium;D_5__uncultured bacterium
This taxon is a major player in coastal column 2 (up to 30%) and also shows up in top two layers of inland sample, making up 2-7% of the community. The actual sequences have only a partial overal (mostly different OTUs). This is a Class-level taxon, so the underlying sequences themselves might be quite different, there is not way to tell without deeper phylogenetic analysis. This clearly shows a marine influence on the inland sites. Other parallels might exist. Especially interesting would be other clearly marine taxa being found to lesser degrees in the inland site.
Hypothesis
Coastal site 2 is closer to the inland site than coastal site 1. Phylogeny of inland site PAUC43f is close to coastal site 2.
Additionally, there appears to be a correlation between PAUC43F and Halobacteria (an Archael class)
gem.halo <- read.csv("analysis/PAUC43.halobacteria.csv")
library(ggplot2)
ggplot(gem.halo, aes(x=PAUC43f, y=Halobacteria)) + geom_point()The pattern is not as striking when plotted perhaps, but is worthy of mathematical analysis.
6.5 Additional observations
casual observations on particular taxa, fishing for stories
6.5.1 Herbaspirillum
Found almost exclusively in the coastal sites and is a major player in most samples
6.5.2 Escherichia-Shigella
Found almost exclusively in inland site, 5% in two of the samples, maybe indicates fecal influence?
6.5.3 Alcanivorax
Found in uppermost layer of inland site and throughout coastal site 1. Generally considered a marine bacterium. Perhaps another instance of marine influence
6.5.4 Acinetobacter
Same OTUs are found at all sites
6.5.5 Pseudomonas
Very important in coastal sites but largely irrelevant inland.
6.5.6 Staphylococcus and Streptococcus
Both are very important in inland sites only, although same OTUs are found across the inland sites
6.5.7 Lawsonella
Two OTUs are imporant in inland sites 4 and 6 and also found throughout the coastal sites
6.5.8 Corynebacterium
one OTU imporant in inland sites 2 and 4 and also throughout coastal sites
6.5.9 Acetothermia
Found at inland sites 1,2,3 and at low numbers coastal, but different OTUs
6.5.10 Hadesarchaeota
Entirely unique to the inland site, especially top two depths
6.5.11 Halobacteria
Enormous diversity, important coastal, not present inland
7 OTU tables
Provided in data packet (not shown here)
8 Conclusions/future work
- We should consider finding a way to filter out chloroplasts at some point in the analysis. This does not inform the community analysis, might actually dominate one of the axes of ordination.
- There is a way to pull out sub-communities, to see which taxa correlate with particular axes of variation. I don’t know how to do this though, its worth looking into perhaps? We could more easily find the “core community” among the sites represented by the left-side clustering of samples on axis 1 in the weighted unifrac ordination.