First trimming used qtrim for quality trimming by bbduk:
#reference files adapters and phix for removing contaminants
for file in *.fastq.gz
do
bbduk.sh -Xmx1g in=$file out=Trim_$file ref=adapters,phix ktrim=r k=23 mink=11 hdist=2
qtrim=rl trimq=10 ftm=5 maq=15 tpe tbo
done
Mapped a few samples to look at alignment rates post-trim
bbmap.sh in=qtrim10_maq10_3_204_P_S3_R1_001.fastq out=qtrim10_maq10_3_204_P.bam ref=~/Desktop/HokeLab_2019/Genomes/GCF_000633615.1_Guppy_female_1.0_MT_genomic.fna covstats=covstats.txt covhist=covhist.txt basecov=basecov.txt bincov=bincov.txt scafstats=scafstats.txt 1>>log.txt 2>&1
bbmap.sh in=reads.fq covstats=constats.txt covhist=covhist.txt basecov=basecov.txt bincov=bincov.txt scafstats=scafstats.txt
Above will generate per-scaffold coverage statistics, a histogram of coverage depth, and the precise coverage for every genomic base, as well as binned coverage which is easier to plot.
To output a bam file (if samtools is installed): bbmap.sh in=reads.fq out=mapped.bam
Plots are not all that informative, problems with overplotting
Code below does not run, files no longer exist, use for reference if wanting to plot scaffold or coverage stats from samtools –statistics
rm(list = ls())
library(ggplot2)
library(tidyverse)
#bin_cov <- read.csv(file = "../../200714_sRNA_fastq/07132020/bincov_hist.csv", header = F)
#cov_stats <- read.csv(file = "../../200714_sRNA_fastq/07132020/covStats_chroms.csv", header = T)
scafstats <- read.csv(file = "../../200714_sRNA_fastq/07132020/scafstats_LG_combined.csv", header = F)
colnames(scafstats) <- c("Chrom","Link_group","Unambig_reads_perc","Unambig_Mb","Ambig_reads_perc","Ambig_Mb","Unambig_reads","Ambig_reads","Assigned_reads","Assigned_bases","Filter_level")
scafstats %>%
select(Link_group, Filter_level, Unambig_reads, Ambig_reads,
Assigned_reads, Assigned_bases) -> scafstats
scafstats %>%
pivot_longer(Unambig_reads:Assigned_bases, names_to = "Metric", values_to = "Values" ) ->
scafstats_long
ggplot(scafstats_long, aes(x = Link_group, y = Values, fill = Filter_level, label = Values)) +
geom_col(position = position_dodge()) +
#facet_grid(rows = scafstats_long$Metric ~ .) +
ylab("Bases Mapped") +
theme_minimal()
As per email correspondence with Tai:
Will do another bbduk run,
The command where you trim with adapters and by quality is perfectly fine. That FastQC isn’t perfectly happy is expected. It’s a tool made for whole genome sequencing QC, you SHOULD see a number of “fail”s with any RNA-seq protocol. Steps that should always fail in RNA-seq:
per-base sequence content (there’s “random priming” performed that isn’t completely random, so the first ~8 bases show a typical bias pattern). per-sequence GC content (particularly in sRNA-seq you’re enriching for a subset of genes, they won’t have some nice GC distribution and that’s perfectly OK). Sequence duplication level (if this doesn’t fail then everything is uniformly expressed…which would be quite unlikely) Overrepresented sequences (as above, though this will be worse for sRNA-seq since reads will often cover the entire transcript, so there’s even less sequence diversity)
Drop quality trimming, using qval of 30 for average read quality
#reference files adapters and phix for removing contaminants
for file in *.fastq.gz
do
bbduk.sh -Xmx1g in=$file out=~/Desktop/HokeLab_2020/200714_sRNA_fastq/07132020/Trim_fastq/TrimMaq30_$file \
ref=adapters,phix ktrim=r k=23 mink=11 hdist=1 ftm=5 maq=30 tpe tbo 2>> trim.log
done
# make array containing names of all fastq files in dir
files_list=`ls`
bbduk_parallel ()
bbduk.sh -Xmx1g in=$file out=Trim_$file ref=adapters,phix ktrim=r k=23 mink=11 hdist=1 ftm=5 maq=30 tpe tbo 2>> trim.log
gzip_parallel () {
local b=$1
gzip trimAdap.$b\_R1.fastq trimAdap.$b\_R2.fastq
#gunzip Qtrim.$b.R1.fq Qtrim.$b.R2.fq
}
for b in $idList; do gzip_parallel "$b" & done