sRNA adapter and quality trimming

First trimming used qtrim for quality trimming by bbduk:

BBTools User Guide


#reference files adapters and phix for removing contaminants

for file in *.fastq.gz
do

    bbduk.sh -Xmx1g in=$file out=Trim_$file ref=adapters,phix ktrim=r k=23 mink=11 hdist=2
    qtrim=rl trimq=10 ftm=5 maq=15 tpe tbo

done

Mapped a few samples to look at alignment rates post-trim


bbmap.sh in=qtrim10_maq10_3_204_P_S3_R1_001.fastq out=qtrim10_maq10_3_204_P.bam ref=~/Desktop/HokeLab_2019/Genomes/GCF_000633615.1_Guppy_female_1.0_MT_genomic.fna covstats=covstats.txt covhist=covhist.txt basecov=basecov.txt bincov=bincov.txt scafstats=scafstats.txt 1>>log.txt 2>&1

General bbmap.sh commands (ref)

bbmap.sh in=reads.fq covstats=constats.txt covhist=covhist.txt basecov=basecov.txt bincov=bincov.txt scafstats=scafstats.txt

Above will generate per-scaffold coverage statistics, a histogram of coverage depth, and the precise coverage for every genomic base, as well as binned coverage which is easier to plot.

To output a bam file (if samtools is installed): bbmap.sh in=reads.fq out=mapped.bam

Attempting to plot scaffold stats

Plots are not all that informative, problems with overplotting

Code below does not run, files no longer exist, use for reference if wanting to plot scaffold or coverage stats from samtools –statistics

rm(list = ls())
library(ggplot2)
library(tidyverse)

#bin_cov <- read.csv(file = "../../200714_sRNA_fastq/07132020/bincov_hist.csv", header = F)
#cov_stats <- read.csv(file = "../../200714_sRNA_fastq/07132020/covStats_chroms.csv", header = T)
scafstats <- read.csv(file = "../../200714_sRNA_fastq/07132020/scafstats_LG_combined.csv", header = F)

colnames(scafstats) <- c("Chrom","Link_group","Unambig_reads_perc","Unambig_Mb","Ambig_reads_perc","Ambig_Mb","Unambig_reads","Ambig_reads","Assigned_reads","Assigned_bases","Filter_level")
  
scafstats %>%
  select(Link_group, Filter_level, Unambig_reads, Ambig_reads, 
         Assigned_reads, Assigned_bases) -> scafstats

scafstats %>%
  pivot_longer(Unambig_reads:Assigned_bases, names_to = "Metric", values_to = "Values" ) ->
  scafstats_long

ggplot(scafstats_long, aes(x = Link_group, y = Values, fill = Filter_level, label = Values)) +
  geom_col(position = position_dodge()) +
  #facet_grid(rows = scafstats_long$Metric ~ .) +
  ylab("Bases Mapped") +
  theme_minimal()

As per email correspondence with Tai:

  1. While you definitely want to trim adapters, you don’t want to trim low-quality bases because this could trim off bases from the small RNA itself yielding a partial sequence rather than the full length small RNA (a minor issue since this will be rare in high quality libraries). Filtering on average quality is good. I suggest an avg Q value of 30.

Will do another bbduk run,


A note on fastqc for small RNA quality control

The command where you trim with adapters and by quality is perfectly fine. That FastQC isn’t perfectly happy is expected. It’s a tool made for whole genome sequencing QC, you SHOULD see a number of “fail”s with any RNA-seq protocol. Steps that should always fail in RNA-seq:

per-base sequence content (there’s “random priming” performed that isn’t completely random, so the first ~8 bases show a typical bias pattern). per-sequence GC content (particularly in sRNA-seq you’re enriching for a subset of genes, they won’t have some nice GC distribution and that’s perfectly OK). Sequence duplication level (if this doesn’t fail then everything is uniformly expressed…which would be quite unlikely) Overrepresented sequences (as above, though this will be worse for sRNA-seq since reads will often cover the entire transcript, so there’s even less sequence diversity)


sRNA adapter and quality trimming - 2nd run

Drop quality trimming, using qval of 30 for average read quality


#reference files adapters and phix for removing contaminants

for file in *.fastq.gz
do

    bbduk.sh -Xmx1g in=$file out=~/Desktop/HokeLab_2020/200714_sRNA_fastq/07132020/Trim_fastq/TrimMaq30_$file \
    ref=adapters,phix ktrim=r k=23 mink=11 hdist=1 ftm=5 maq=30 tpe tbo 2>> trim.log

done

# make array containing names of all fastq files in dir
files_list=`ls`

bbduk_parallel ()

bbduk.sh -Xmx1g in=$file out=Trim_$file ref=adapters,phix ktrim=r k=23 mink=11 hdist=1 ftm=5 maq=30 tpe tbo 2>> trim.log




gzip_parallel () {

       local b=$1

        gzip trimAdap.$b\_R1.fastq trimAdap.$b\_R2.fastq
           
        #gunzip Qtrim.$b.R1.fq Qtrim.$b.R2.fq     
      
}


for b in $idList; do gzip_parallel "$b" & done