Introduction

This workflow describes how the analysis of data from Lexogen’s 3’ Quantseq technique generated on an Illumina Nextseq can be analysed using R/Bioconductor. The Lexogen approach generates reads from the 3’UTR of an mRNA transcript and can be used for count-based differential expression analysis. Lexogen Quantseq libraries cost ~£15 per sample, and this low cost allows more samples to be profiled for a given budget. Larger experiments permit more replicates, more time points, more compounds, or more cell lines to be included in an experiment, potentially making it more informative.

See the Lexogen website and a Nature Methods advertising feature for more information. This figure below is a schematic of the library preparation protocol:

lib prep

lib prep

Methods

The analysis is primary carried out using the ShortRead and RSubread Bioconductor packages for NGS analysis with ggplot2 and the cowplot extension package used for visualisation. The author thanks Martin Morgan, the ShortRead package author, for his assistance in developing this workflow.

Define locations

Locations of various files and directories.

    setwd('~/BigData/Scratch/2016Lexogen/')
    fastq_dir <- '/Users/pchapman/BigData/Lexogen/raw_fastq'
    ref_fa <- '/Users/pchapman/BigData/ENSEMBL/hs84/Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa'
    ref_dir <- '/Users/pchapman/BigData/ENSEMBL/hs84'
    gtf.file <- file.path(ref_dir, "Homo_sapiens.GRCh38.84.gtf")
    sqlite_file <- 'Homo_sapiens.GRCh38.84.sqlite'
    sqlite_path <- file.path(ref_dir, sqlite_file)
    workflow_stats <- list()

Define some functions for plotting

Here we define some ggplot2 functions based on the qa function of the ShortRead package.

   qa_plots <- function(shortfq_obj) {
    x <- ShortRead::qa(shortfq_obj, lane='')
    nuc_plot <- ggplot(x[['perCycle']]$baseCall, aes(x=Cycle, y=Count, colour=Base)) + geom_line() + cowplot::theme_cowplot()
    qual_plot <- x[['perCycle']]$quality %>% group_by(Cycle) %>% summarise(QualScore=sum(Score*Count)/sum(Count)) %>% ungroup() %>% 
        ggplot(aes(x=Cycle, y=QualScore)) + geom_line() + cowplot::theme_cowplot()
    size_plot <- ggplot(data.frame(frag_size=width(shortfq_obj)), aes(x=frag_size)) + 
        geom_bar(fill='pink', colour='black') + cowplot::theme_cowplot()
    base_plot <- ggplot(data.frame(idx=1:50, dat=as.character(sread(shortfq_obj)[1:50]))) + 
        geom_text(aes(x=1, y=idx, label=dat), size=rel(2.5), family='mono', hjust=0) + xlim(1,1.8) +
        theme_bw() + theme(axis.text.x=element_text(size=0)) + xlab('')
    return(list(nuc_plot=nuc_plot, qual_plot=qual_plot, size_plot=size_plot, base_plot=base_plot))
} 

Import a sample of data

The ShortRead::yield function is used to load a subset of 1M 151bp single end Nextseq reads from a .fastq file.

fq_fn <- grep('DO9_', list.files(fastq_dir, 'fastq.gz'), value = TRUE)[1]
fq_path <- file.path(fastq_dir, fq_fn)
fq_data <- ShortRead::yield(FastqSampler(fq_path, 1000000))

Explore the raw data

We now generate 3 plots using the qa_plots function defined above: * Plot A depicts the proportion of reads with each nucleotide at positions 1-151. It is evident that G’s predominate towards the end of the reads, and A’s increase up to around 60bp before reducing. The read distribution seems to change markedly at around position 10. * Plot B depicts the average quality score at positions 1-151 across all reads. Quality reduces towards the end of the read with an uptick towards the very end, and the first 10bp seem to have lower quality than the second 10bp. * Plot C is a histogram of read lengths: all reads are 151bp at this stage.

qp <- qa_plots(fq_data)
## Warning: closing unused connection 5 (/Users/pchapman/BigData/Lexogen/
## raw_fastq/DO9_2_DMSO_ctrl_5Day_S2_L003_R1_001.fastq.gz)
cowplot::plot_grid(qp$nuc_plot, qp$qual_plot, qp$size_plot, labels='auto')

Viewing a sample of 50 reads demonstrates what can be seen in the summary plots above. Many reads have strings of G’s towards the end, and poly-A tails are also in evidence. The NextSeq sequencing chemistry returns a G when there is very low signal, suggesting that in fact the increase in incidence of G’s is related to sequence quality rather than really being sequence related.

qp$base_plot