DCM contaminated site mtranscriptomes
1 overview
metatranscriptomes from a DCM contaminated site have been made available. The goal here is to make a targeted search for mec cassette genes.
1.1 the pipeline
General pipeline is:
- Quality trim with Trimmomatic
- Assemble with Trinity, include a digital normalization
- run tblastx against a database of Defo mec genes
- filter rRNA from the assembled transcriptome using barrnap and custom R scripting
- run kallisto via Trinity utility script to generate TPM
2 The data:
Files are all roughtly 0.5-1 Gb compressed. Only single-direction read sets are available.
3 FastQC
This is run through the GUI-based java applet
There are a huge number of reads (up to 40%) with a “Nextera Transposase Sequence” at the 3’ end. This is due to either fragmentation of the RNA read library or to high abundance of microRNA. It is all removable using Trimmomatic. Roughly 10-25 million reads per run
4 Trimmomatic
for f in ../CSI-MT/*
do
name=${f##*/}
trimmomatic SE \
"$f" \
../trimmed/"${name}" \
ILLUMINACLIP:/home/robert/Tools/Trimmomatic-0.38/adapters/NexteraPE-PE.fa:2:30:10 \
LEADING:3 TRAILING:3 \
SLIDINGWINDOW:6:25 MINLEN:25
done
over 95% of reads are retained. Adapter content is remedied by this approach.
5 Trinity
running a standard assembly with exclusion of read-normalization. The Trinity manual states that normalization is appropriate for very large data sets (over 300M reads).
for f in ../trimmed/*
do
name=${f##*/}
Trinity \
--seqType fq \
--max_memory 44G \
--no_normalize_reads \
--single "$f" \
--CPU 6 \
--output ../assemblies/"${name}"/trinity/
done
Assemblies took 1.5-2.5 hours each. Final assemblies were renamed manually and placed in a devoted folder for tblastx analysis
6 tBLASTx
the transcript contigs are searched against a database of Dehalobacterium formicoaceticum mec cassette proteins. Blast results tables are manually curated to retain only full-length hits and the best hit for each contig (tblastx will return several redundant hits across the same homologous regions)
note on Trinity naming redundancies many contigs are named almost identically and seem to be at least very similar, i.e. TRINITY_DN791_c0_g1_i2 and TRINITY_DN791_c0_g1_i6 this has been brought up with the developers (https://github.com/trinityrnaseq/trinityrnaseq/issues/275) and it is not clear why this happens or if these are actually distinct from one another according to their output description (https://github.com/trinityrnaseq/trinityrnaseq/wiki/Output-of-Trinity-Assembly), the final “i” designate “isoforms”. These are very similar transcripts, according to their definitions. There are ways to “condense” these, presumably I could just use a CD-hit style approach. For now I will consider them all.
Basic requirements for retaining hits: * must span at least 50% of the target gene length * e-value <0.001
Exceptions are made in cases where transcripts appear truncated or ORFs are split
Blast result files are carefully curated and combined into a single table of contigs and homologs.
7 rRNA read contamination
The most abundant rRNA transcripts tend to be vastly over represented. WW01I blastn results of most abundant transcripts:
- desulfovibrio 23S rRNA
- desulfosporosinus 16S rRNA partial sequence
- Trypanosoma 28S rRNA
- various bacteria 23S rRNA
- desulfosporosinus 23S rRNA
In this case, these top 5 make up 11% of the total data burden.
This is be handled post-assembly by using barrnap to detect rRNA:
for f in ../tblastx/queries/*
do
name=${f##*/}
barrnap -t 8 -o ../tblastx/rRNA/"$name" "$f"
done
evalue 1e-06, tag genes <0.8 expected length and reject genes less than 0.25 expected length. Guilty transcripts are removed here in R:
rnaSplit <- function(x){
strsplit(x,":")[[1]][3]
}
##ex: name <- unlist(lapply(name, splitfun))
y <- list.files("transcriptome/tblastx/queries/")
for (x in (1:length(y))) {
rna <- readDNAStringSet(paste("transcriptome/tblastx/rRNA/",y[x],sep=""))
rnames <- names(rna)
rnames <- unlist(lapply(rnames,rnaSplit))
trans <- readDNAStringSet(paste("transcriptome/tblastx/queries/", y[x],sep=""))
name <- names(trans)
name <- unlist(lapply(name,spcSplit))
seq <- paste(trans)
trans <- data.frame(name,seq)
print(paste(y[x],"has",nrow(trans),"transcripts prior to rRNA filtering"))
trans <- trans[!trans$name %in% rnames,]
print(paste(y[x],"has",nrow(trans),"transcripts after rRNA filtering"))
writeFasta(trans, paste("transcriptome/tblastx/filtered.transcripts/",y[x],sep = ""))
}8 relative expression calcualtions
This is performed using the Trinity utility script align_and_estimate_abundance.pl
I apply kallisto (alignment-free) method. A dedicated conda environment was created for this step, as kallisto is temperamental.
#!/bin/bash
r=(33I104_S17_L001_R1_001.fastq.gz \
47s104_S12_L001_R1_001.fastq.gz \
48I104_S13_L001_R1_001.fastq.gz \
GW01I99_S19_L001_R1_001.fastq.gz \
GW01I100p_S22_L001_R1_001.fastq.gz \
GW33I99p_S15_L001_R1_001.fastq.gz \
GW47s_100_S16_L001_R1_001.fastq.gz \
GW48I99_S18_L001_R1_001.fastq.gz \
WW01I_S20_L001_R1_001.fastq.gz \
WW42sp_S14_L001_R1_001.fastq.gz)
t=(33I104.Trinity.fasta \
47s104.Trinity.fasta \
48I104.Trinity.fasta \
GW01I99.Trinity.fasta \
GW01I100p.Trinity.fasta \
GW33I99p.Trinity.fasta \
GW47s.Trinity.fasta \
GW48I99Trinity.fasta \
WW01I.Trinity.fasta \
WW42sp.Trinity.fasta)
for x in {0..9}
do
/home/robert/miniconda3/bin/align_and_estimate_abundance.pl \
--prep_reference \
--transcripts ../tblastx/filtred.transcripts/"${t[x]}" \
--seqType fq \
--single ../CSI-MT/"${r[x]}" \
--est_method kallisto \
--output_dir ../kallisto/"${t[x]}" \
--trinity_mode \
--fragment_length 100 \
--fragment_std 20 \
--thread_count 6
done
gene expression terms explained here: https://www.rna-seqblog.com/rpkm-fpkm-and-tpm-clearly-explained/ Kallisto uses TPM, or “transcripts per kilobase million”.
Table 2 of this manuscript https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6011919/ has a nice clean way to present TPM and “quantile” expression.
Our samples are complex, with 5,000 - 10,000 transcripts detected per mtranscriptome. This surely represents just a miniscule fraction of the genes present in the community. Any detection should be regarded as important!
9 parsing all TPM results
most samples have multiple transcripts with mec genes. Start by making a table which sums each gene for each sample
library(openxlsx)
library(dplyr)
library(stringr)
files <- list.files("transcriptome/tblastx/refined.results")
sumTPMs <- data.frame(matrix(ncol=4))
colnames(sumTPMs) <- c("mec","TPM","n","sample")
for (x in c(1:length(files))) {
out <- read.xlsx(paste0("transcriptome/tblastx/refined.results/",files[x]))
mecsTPM <- out %>% group_by(mec) %>% summarise(TPM = (sum(TPM)))
mecsAllele <- out %>% group_by(mec) %>% tally()
mecs <- left_join(mecsTPM, mecsAllele)
mecs$sample <- str_remove(files[x],".Trinity.fasta.blast.xlsx")
sumTPMs <- rbind(sumTPMs, mecs)
}
sumTPMs <- sumTPMs[-1,]
datatable(sumTPMs)