1 overview

metatranscriptomes from a DCM contaminated site have been made available. The goal here is to make a targeted search for mec cassette genes.

1.1 the pipeline

General pipeline is:

Quality trim with Trimmomatic
Assemble with Trinity, include a digital normalization
run tblastx against a database of Defo mec genes
filter rRNA from the assembled transcriptome using barrnap and custom R scripting
run kallisto via Trinity utility script to generate TPM

2 The data:

files <- list.files("transcriptome/CSI-MT")
files <- data.frame(files)
datatable(files)

Files are all roughtly 0.5-1 Gb compressed. Only single-direction read sets are available.

3 FastQC

This is run through the GUI-based java applet

There are a huge number of reads (up to 40%) with a “Nextera Transposase Sequence” at the 3’ end. This is due to either fragmentation of the RNA read library or to high abundance of microRNA. It is all removable using Trimmomatic. Roughly 10-25 million reads per run

4 Trimmomatic

for f in ../CSI-MT/*
    do
    name=${f##*/}
    trimmomatic SE \
    "$f" \
    ../trimmed/"${name}" \
    ILLUMINACLIP:/home/robert/Tools/Trimmomatic-0.38/adapters/NexteraPE-PE.fa:2:30:10 \
    LEADING:3 TRAILING:3 \
    SLIDINGWINDOW:6:25 MINLEN:25
done

over 95% of reads are retained. Adapter content is remedied by this approach.

5 Trinity

running a standard assembly with exclusion of read-normalization. The Trinity manual states that normalization is appropriate for very large data sets (over 300M reads).

for f in ../trimmed/*
    do
    name=${f##*/}
    Trinity \
    --seqType fq \
    --max_memory 44G \
    --no_normalize_reads \
    --single "$f" \
    --CPU 6 \
    --output ../assemblies/"${name}"/trinity/
done

Assemblies took 1.5-2.5 hours each. Final assemblies were renamed manually and placed in a devoted folder for tblastx analysis

6 tBLASTx

the transcript contigs are searched against a database of Dehalobacterium formicoaceticum mec cassette proteins. Blast results tables are manually curated to retain only full-length hits and the best hit for each contig (tblastx will return several redundant hits across the same homologous regions)

note on Trinity naming redundancies many contigs are named almost identically and seem to be at least very similar, i.e. TRINITY_DN791_c0_g1_i2 and TRINITY_DN791_c0_g1_i6 this has been brought up with the developers (https://github.com/trinityrnaseq/trinityrnaseq/issues/275) and it is not clear why this happens or if these are actually distinct from one another according to their output description (https://github.com/trinityrnaseq/trinityrnaseq/wiki/Output-of-Trinity-Assembly), the final “i” designate “isoforms”. These are very similar transcripts, according to their definitions. There are ways to “condense” these, presumably I could just use a CD-hit style approach. For now I will consider them all.

Basic requirements for retaining hits: * must span at least 50% of the target gene length * e-value <0.001

Exceptions are made in cases where transcripts appear truncated or ORFs are split

Blast result files are carefully curated and combined into a single table of contigs and homologs.

7 rRNA read contamination

The most abundant rRNA transcripts tend to be vastly over represented. WW01I blastn results of most abundant transcripts:

desulfovibrio 23S rRNA
desulfosporosinus 16S rRNA partial sequence
Trypanosoma 28S rRNA
various bacteria 23S rRNA
desulfosporosinus 23S rRNA

In this case, these top 5 make up 11% of the total data burden.

This is be handled post-assembly by using barrnap to detect rRNA:

for f in ../tblastx/queries/*
    do
    name=${f##*/}
    barrnap -t 8 -o ../tblastx/rRNA/"$name" "$f"
done

evalue 1e-06, tag genes <0.8 expected length and reject genes less than 0.25 expected length. Guilty transcripts are removed here in R:

rnaSplit <- function(x){
  strsplit(x,":")[[1]][3]
}
##ex: name <- unlist(lapply(name, splitfun))

y <- list.files("transcriptome/tblastx/queries/")
for (x in (1:length(y))) {
  rna <- readDNAStringSet(paste("transcriptome/tblastx/rRNA/",y[x],sep=""))
  rnames <- names(rna)
  rnames <- unlist(lapply(rnames,rnaSplit))
  trans <- readDNAStringSet(paste("transcriptome/tblastx/queries/", y[x],sep=""))
  name <- names(trans)
  name <- unlist(lapply(name,spcSplit))
  seq <- paste(trans)
  trans <- data.frame(name,seq)
  print(paste(y[x],"has",nrow(trans),"transcripts prior to rRNA filtering"))
  trans <- trans[!trans$name %in% rnames,]
  print(paste(y[x],"has",nrow(trans),"transcripts after rRNA filtering"))
  writeFasta(trans, paste("transcriptome/tblastx/filtered.transcripts/",y[x],sep = ""))
}

8 relative expression calcualtions

This is performed using the Trinity utility script align_and_estimate_abundance.pl

I apply kallisto (alignment-free) method. A dedicated conda environment was created for this step, as kallisto is temperamental.

#!/bin/bash

r=(33I104_S17_L001_R1_001.fastq.gz \
47s104_S12_L001_R1_001.fastq.gz \
48I104_S13_L001_R1_001.fastq.gz \
GW01I99_S19_L001_R1_001.fastq.gz \
GW01I100p_S22_L001_R1_001.fastq.gz \
GW33I99p_S15_L001_R1_001.fastq.gz \
GW47s_100_S16_L001_R1_001.fastq.gz \
GW48I99_S18_L001_R1_001.fastq.gz \
WW01I_S20_L001_R1_001.fastq.gz \
WW42sp_S14_L001_R1_001.fastq.gz)

t=(33I104.Trinity.fasta \
47s104.Trinity.fasta \
48I104.Trinity.fasta \
GW01I99.Trinity.fasta \
GW01I100p.Trinity.fasta \
GW33I99p.Trinity.fasta \
GW47s.Trinity.fasta \
GW48I99Trinity.fasta \
WW01I.Trinity.fasta \
WW42sp.Trinity.fasta)

for x in {0..9}
    do
    /home/robert/miniconda3/bin/align_and_estimate_abundance.pl \
    --prep_reference \
    --transcripts ../tblastx/filtred.transcripts/"${t[x]}" \
    --seqType fq \
    --single ../CSI-MT/"${r[x]}" \
    --est_method kallisto \
    --output_dir ../kallisto/"${t[x]}" \
    --trinity_mode \
    --fragment_length 100 \
    --fragment_std 20 \
    --thread_count 6
done

gene expression terms explained here: https://www.rna-seqblog.com/rpkm-fpkm-and-tpm-clearly-explained/ Kallisto uses TPM, or “transcripts per kilobase million”.

Table 2 of this manuscript https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6011919/ has a nice clean way to present TPM and “quantile” expression.

Our samples are complex, with 5,000 - 10,000 transcripts detected per mtranscriptome. This surely represents just a miniscule fraction of the genes present in the community. Any detection should be regarded as important!

9 parsing all TPM results

most samples have multiple transcripts with mec genes. Start by making a table which sums each gene for each sample

library(openxlsx)
library(dplyr)
library(stringr)
files <- list.files("transcriptome/tblastx/refined.results")
sumTPMs <- data.frame(matrix(ncol=4))
colnames(sumTPMs) <- c("mec","TPM","n","sample")

for (x in c(1:length(files))) {
  out <- read.xlsx(paste0("transcriptome/tblastx/refined.results/",files[x]))
  mecsTPM <- out %>% group_by(mec) %>% summarise(TPM = (sum(TPM)))
  mecsAllele <- out %>% group_by(mec) %>% tally()
  mecs <- left_join(mecsTPM, mecsAllele)
  mecs$sample <- str_remove(files[x],".Trinity.fasta.blast.xlsx")
  sumTPMs <- rbind(sumTPMs, mecs)
}

sumTPMs <- sumTPMs[-1,]

datatable(sumTPMs)

DCM contaminated site mtranscriptomes