Analysis of Lexogen Quantseq Data: Part 1

Introduction
Methods
Define locations
Define some functions for plotting
Import a sample of data
Explore the raw data
Remove the 5’ Adapter
Remove poor quality data at the 3’ end of the read
Trim poly-A tails
Trim embedded poly-A tails
Align to reference genome using Rsubread
Count reads mapping to features
Count reads mapping to protein-coding genes
Evaluate the performance of the workflow
Conclusion
Session Info

Introduction

This workflow describes how the analysis of data from Lexogen’s 3’ Quantseq technique generated on an Illumina Nextseq can be analysed using R/Bioconductor. The Lexogen approach generates reads from the 3’UTR of an mRNA transcript and can be used for count-based differential expression analysis. Lexogen Quantseq libraries cost ~£15 per sample, and this low cost allows more samples to be profiled for a given budget. Larger experiments permit more replicates, more time points, more compounds, or more cell lines to be included in an experiment, potentially making it more informative.

See the Lexogen website and a Nature Methods advertising feature for more information. This figure below is a schematic of the library preparation protocol:

lib prep

Methods

The analysis is primary carried out using the ShortRead and RSubread Bioconductor packages for NGS analysis with ggplot2 and the cowplot extension package used for visualisation. The author thanks Martin Morgan, the ShortRead package author, for his assistance in developing this workflow.

Define locations

Locations of various files and directories.

    setwd('~/BigData/Scratch/2016Lexogen/')
    fastq_dir <- '/Users/pchapman/BigData/Lexogen/raw_fastq'
    ref_fa <- '/Users/pchapman/BigData/ENSEMBL/hs84/Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa'
    ref_dir <- '/Users/pchapman/BigData/ENSEMBL/hs84'
    gtf.file <- file.path(ref_dir, "Homo_sapiens.GRCh38.84.gtf")
    sqlite_file <- 'Homo_sapiens.GRCh38.84.sqlite'
    sqlite_path <- file.path(ref_dir, sqlite_file)
    workflow_stats <- list()

Define some functions for plotting

Here we define some ggplot2 functions based on the qa function of the ShortRead package.

   qa_plots <- function(shortfq_obj) {
    x <- ShortRead::qa(shortfq_obj, lane='')
    nuc_plot <- ggplot(x[['perCycle']]$baseCall, aes(x=Cycle, y=Count, colour=Base)) + geom_line() + cowplot::theme_cowplot()
    qual_plot <- x[['perCycle']]$quality %>% group_by(Cycle) %>% summarise(QualScore=sum(Score*Count)/sum(Count)) %>% ungroup() %>% 
        ggplot(aes(x=Cycle, y=QualScore)) + geom_line() + cowplot::theme_cowplot()
    size_plot <- ggplot(data.frame(frag_size=width(shortfq_obj)), aes(x=frag_size)) + 
        geom_bar(fill='pink', colour='black') + cowplot::theme_cowplot()
    base_plot <- ggplot(data.frame(idx=1:50, dat=as.character(sread(shortfq_obj)[1:50]))) + 
        geom_text(aes(x=1, y=idx, label=dat), size=rel(2.5), family='mono', hjust=0) + xlim(1,1.8) +
        theme_bw() + theme(axis.text.x=element_text(size=0)) + xlab('')
    return(list(nuc_plot=nuc_plot, qual_plot=qual_plot, size_plot=size_plot, base_plot=base_plot))
}

Import a sample of data

The ShortRead::yield function is used to load a subset of 1M 151bp single end Nextseq reads from a .fastq file.

fq_fn <- grep('DO9_', list.files(fastq_dir, 'fastq.gz'), value = TRUE)[1]
fq_path <- file.path(fastq_dir, fq_fn)
fq_data <- ShortRead::yield(FastqSampler(fq_path, 1000000))

Explore the raw data

We now generate 3 plots using the qa_plots function defined above: * Plot A depicts the proportion of reads with each nucleotide at positions 1-151. It is evident that G’s predominate towards the end of the reads, and A’s increase up to around 60bp before reducing. The read distribution seems to change markedly at around position 10. * Plot B depicts the average quality score at positions 1-151 across all reads. Quality reduces towards the end of the read with an uptick towards the very end, and the first 10bp seem to have lower quality than the second 10bp. * Plot C is a histogram of read lengths: all reads are 151bp at this stage.

qp <- qa_plots(fq_data)

## Warning: closing unused connection 5 (/Users/pchapman/BigData/Lexogen/
## raw_fastq/DO9_2_DMSO_ctrl_5Day_S2_L003_R1_001.fastq.gz)

cowplot::plot_grid(qp$nuc_plot, qp$qual_plot, qp$size_plot, labels='auto')

Viewing a sample of 50 reads demonstrates what can be seen in the summary plots above. Many reads have strings of G’s towards the end, and poly-A tails are also in evidence. The NextSeq sequencing chemistry returns a G when there is very low signal, suggesting that in fact the increase in incidence of G’s is related to sequence quality rather than really being sequence related.

qp$base_plot

It is also useful to view the most frequent sequences which shows that strings of A’s followed by G’s are very common:

ShortRead:::.freqSequences(qa(fq_data, lane=''), "read")

##                                                                                                                                                   sequence
## 1  GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
## 2  AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
## 3  AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
## 4  AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
## 5  AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
## 6  AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
## 7  AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
## 8  AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGAGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
## 9  AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGAGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
## 10 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGAGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
## 11 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
## 12 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
## 13 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
## 14 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
## 15 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
## 16 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGAGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
## 17 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGAGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
## 18 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
## 19 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGAGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
## 20 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
##    count lane
## 1    334     
## 2    133     
## 3    124     
## 4    120     
## 5    118     
## 6    101     
## 7     90     
## 8     82     
## 9     81     
## 10    76     
## 11    73     
## 12    68     
## 13    62     
## 14    61     
## 15    61     
## 16    57     
## 17    54     
## 18    53     
## 19    50     
## 20    50

Remove the 5’ Adapter

The first 12bp are in fact adapter sequence, hence the market shift in both quality and base distributions at this point. These can be removed using the ShortRead::narrow function:

fq <- fq_data
fq <- ShortRead::narrow(fq, start=13)
workflow_stats$step01_start <- length(fq)
length(fq)

## [1] 1000000

Regenerating the qa plots confirms that this has happened and we now have a 141 bp read size:

qp <- qa_plots(fq)
cowplot::plot_grid(qp$nuc_plot, qp$qual_plot, qp$size_plot, labels='auto')

Remove poor quality data at the 3’ end of the read

The ShortRead::trimTailw function removes low-quality reads from the 3’ end using a sliding window of nucleotides falling at or below a quality threshold. Here we define a sliding window size of 12 by setting halfwidth to 6, a number of nucleotides as 6, and a quality threshold of 4. We also remove reads that are now below 36bp in length:

fq <- ShortRead::trimTailw(fq, k=6, a="4", halfwidth=6)
fq <- fq[width(fq) >= 36]
workflow_stats$step02_poorqual <- length(fq)
length(fq)

## [1] 829306

The quality plots now look markedly different before with the relative proportions of the nucleotides looking much more even, although A’s are still over-represented since we haven’t removed the poly-A sequences yet. Quality still declines but now levels out as would be expected given our filtering. There also appear to be a subset of full length reads of high quality as indicated by the uptick at the right hand side of plots B and C.

qp <- qa_plots(fq)
cowplot::plot_grid(qp$nuc_plot, qp$qual_plot, qp$size_plot, labels='auto')

It is useful to eyeball a sample of reads at this point to confirm the removal of the low quality G bases and the retention of the poly-A sequences.

qp$base_plot

Trim poly-A tails

The poly-A tails can be identified using the ShortRead::trimEnds function to examine the right end of the reads and return the location of any A’s. Setting ranges = TRUE returns a ranges object which can then be fed into the ShortReads::narrow function to do the actual clipping.

narrow_ranges <- ShortRead::trimEnds(sread(fq), right=TRUE, "A", relation="==", ranges=TRUE) 
fq <- ShortRead::narrow(fq, start(narrow_ranges), end(narrow_ranges))
fq <- fq[width(fq) >= 36]
workflow_stats$step03_polya <- length(fq)
length(fq)

## [1] 745519

Plot A shows that we have succeeded in reducing the over-representation of A’s in the first at around 50 cycles.

qp <- qa_plots(fq)
cowplot::plot_grid(qp$nuc_plot, qp$qual_plot, qp$size_plot, labels='auto')

Looking at the reads we can confirm the removal of poly-A’s , however, there are still some embedded poly-A sequences that are 3’flanked by non-A nucleotides which need to be removed.

qp$base_plot

Trim embedded poly-A tails

Here we apply the regexpr function to the read sequences (using ShortRead::sread) to identify the first position of a polyA of length 10. This returns a vector of the same length as the ShortReadQ object which is either the start position of the polyA or -1 if no polyA was present. This vector can be combined with the length of the read and fed into the ShortRead::narrow function to clip those reads that need to be clipped. This gives the final set of reads for mapping.

polyApos <- regexpr(pattern= 'AAAAAAAAAA', sread(fq), fixed=TRUE) %>% unlist()
polyAclip_idx <- which(polyApos >= 0)
polyAclip <- width(fq)
polyAclip[polyAclip_idx] <- polyApos[polyAclip_idx] 
fq <- narrow(fq, end=polyAclip)
fq <- fq[width(fq) >= 36]
workflow_stats$step04_3padapt <- length(fq)
length(fq)

## [1] 721162

We see a slight improvement in the quality plots.

qp <- qa_plots(fq)
cowplot::plot_grid(qp$nuc_plot, qp$qual_plot, qp$size_plot, labels='auto')

And confirm that embedded poly-A’s have been removed.

qp$base_plot

Align to reference genome using Rsubread

At this point we have data that can be fed into a standard single end RNAseq workflow. RSubread is a useful aligner for single end RNAseq data since it doesn’t try to map across splice junctions

tmp_fastq <- tempfile()
writeFastq(fq, tmp_fastq)
tmp_bam <- tempfile()
align(index=file.path(ref_dir,"hs84_subread_index"),
      readfile1=tmp_fastq,   
      input_format='gzFASTQ', 
      output_file=tmp_bam,   
      nthreads=8)

## 
##         ==========     _____ _    _ ____  _____  ______          _____  
##         =====         / ____| |  | |  _ \|  __ \|  ____|   /\   |  __ \ 
##           =====      | (___ | |  | | |_) | |__) | |__     /  \  | |  | |
##             ====      \___ \| |  | |  _ <|  _  /|  __|   / /\ \ | |  | |
##               ====    ____) | |__| | |_) | | \ \| |____ / ____ \| |__| |
##         ==========   |_____/ \____/|____/|_|  \_\______/_/    \_\_____/
##        Rsubread 1.24.0
## 
## //========================== subread-align setting ===========================\\
## ||                                                                            ||
## || Function      : Read alignment (RNA-Seq)                                   ||
## || Input file    : /var/folders/9d/hs3g06m50095p0_tphcpyzy80000gp/T//Rtmp ... ||
## || Output file   : /var/folders/9d/hs3g06m50095p0_tphcpyzy80000gp/T//Rtmp ... ||
## || Index name    : /Users/pchapman/BigData/ENSEMBL/hs84/hs84_subread_index    ||
## ||                                                                            ||
## ||                       Threads : 8                                          ||
## ||                  Phred offset : 33                                         ||
## ||                     Min votes : 3 / 10                                     ||
## ||    Maximum allowed mismatches : 3                                          ||
## ||   Maximum allowed indel bases : 5                                          ||
## || # of best alignments reported : 1                                          ||
## ||                Unique mapping : yes                                        ||
## ||                                                                            ||
## \\===================== http://subread.sourceforge.net/ ======================//
## 
## //====================== Running (08-Dec-2016 07:20:26) ======================\\
## ||                                                                            ||
## || The input file contains base space reads.                                  ||
## || The range of Phred scores observed in the data is [2,36]                   ||
## || Load the 1-th index block...                                               ||
## ||    7% completed, 0.3 mins elapsed, rate=10.6k reads per second             ||
## ||   13% completed, 0.3 mins elapsed, rate=17.2k reads per second             ||
## ||   20% completed, 0.3 mins elapsed, rate=22.3k reads per second             ||
## ||   27% completed, 0.4 mins elapsed, rate=25.2k reads per second             ||
## ||   34% completed, 0.4 mins elapsed, rate=27.7k reads per second             ||
## ||   40% completed, 0.4 mins elapsed, rate=30.3k reads per second             ||
## ||   47% completed, 0.4 mins elapsed, rate=31.0k reads per second             ||
## ||   53% completed, 0.4 mins elapsed, rate=32.7k reads per second             ||
## ||   60% completed, 0.4 mins elapsed, rate=34.3k reads per second             ||
## ||   70% completed, 0.5 mins elapsed, rate=16.7k reads per second             ||
## ||   73% completed, 0.5 mins elapsed, rate=16.7k reads per second             ||
## ||   76% completed, 0.5 mins elapsed, rate=16.8k reads per second             ||
## ||   80% completed, 0.6 mins elapsed, rate=16.8k reads per second             ||
## ||   83% completed, 0.6 mins elapsed, rate=17.0k reads per second             ||
## ||   86% completed, 0.6 mins elapsed, rate=17.1k reads per second             ||
## ||   90% completed, 0.6 mins elapsed, rate=17.2k reads per second             ||
## ||   93% completed, 0.6 mins elapsed, rate=17.3k reads per second             ||
## ||   96% completed, 0.7 mins elapsed, rate=17.4k reads per second             ||
## ||   99% completed, 0.7 mins elapsed, rate=17.5k reads per second             ||
## ||                                                                            ||
## ||                          Completed successfully.                           ||
## ||                                                                            ||
## \\============================================================================//
## 
## //================================= Summary ==================================\\
## ||                                                                            ||
## ||          Processed : 721,162 reads                                         ||
## ||             Mapped : 588,889 reads (81.7%)                                 ||
## ||             Indels : 2,612                                                 ||
## ||                                                                            ||
## ||       Running time : 0.7 minutes                                           ||
## ||                                                                            ||
## \\===================== http://subread.sourceforge.net/ ======================//

workflow_stats$step05_alignscores <- propmapped(tmp_bam)

## The input file is opened as a BAM file.
## The fragments in the input file are being counted.
## Finished. All records: 721162; all fragments: 721162; mapped fragments: 588889; the mappability is 81.66%

Count reads mapping to features

Use RSubread::featureCounts to count features mapping to exons: note that strand-specific is set to 1 whereas, depending on the library preparation method, this is often set to 2 for poly-A RNAseq.

fc_ensembl_84 <- featureCounts(files=tmp_bam,
                               annot.ext=gtf.file,
                               isGTFAnnotationFile=TRUE,
                               GTF.featureType="exon",
                               GTF.attrType="gene_id",
                               useMetaFeatures=TRUE,
                               allowMultiOverlap=FALSE,
                               nthreads=8,
                               strandSpecific=1,
                               countMultiMappingReads=FALSE)

## 
##         ==========     _____ _    _ ____  _____  ______          _____  
##         =====         / ____| |  | |  _ \|  __ \|  ____|   /\   |  __ \ 
##           =====      | (___ | |  | | |_) | |__) | |__     /  \  | |  | |
##             ====      \___ \| |  | |  _ <|  _  /|  __|   / /\ \ | |  | |
##               ====    ____) | |__| | |_) | | \ \| |____ / ____ \| |__| |
##         ==========   |_____/ \____/|____/|_|  \_\______/_/    \_\_____/
##        Rsubread 1.24.0
## 
## //========================== featureCounts setting ===========================\\
## ||                                                                            ||
## ||             Input files : 1 BAM file                                       ||
## ||                           S /var/folders/9d/hs3g06m50095p0_tphcpyzy800 ... ||
## ||                                                                            ||
## ||             Output file : ./.Rsubread_featureCounts_pid4084                ||
## ||                 Summary : ./.Rsubread_featureCounts_pid4084.summary        ||
## ||              Annotation : /Users/pchapman/BigData/ENSEMBL/hs84/Homo_sa ... ||
## ||      Dir for temp files : .                                                ||
## ||                                                                            ||
## ||                 Threads : 8                                                ||
## ||                   Level : meta-feature level                               ||
## ||              Paired-end : no                                               ||
## ||         Strand specific : stranded                                         ||
## ||      Multimapping reads : not counted                                      ||
## || Multi-overlapping reads : not counted                                      ||
## ||       Overlapping bases : 0.0%                                             ||
## ||                                                                            ||
## \\===================== http://subread.sourceforge.net/ ======================//
## 
## //================================= Running ==================================\\
## ||                                                                            ||
## || Load annotation file /Users/pchapman/BigData/ENSEMBL/hs84/Homo_sapiens ... ||
## ||    Features : 1176808                                                      ||
## ||    Meta-features : 60675                                                   ||
## ||    Chromosomes/contigs : 59                                                ||
## ||                                                                            ||
## || Process BAM file /var/folders/9d/hs3g06m50095p0_tphcpyzy80000gp/T//Rtm ... ||
## ||    Single-end reads are included.                                          ||
## ||    Assign reads to features...                                             ||
## ||    Total reads : 721162                                                    ||
## ||    Successfully assigned reads : 530294 (73.5%)                            ||
## ||    Running time : 0.01 minutes                                             ||
## ||                                                                            ||
## ||                         Read assignment finished.                          ||
## ||                                                                            ||
## \\===================== http://subread.sourceforge.net/ ======================//

Count reads mapping to protein-coding genes

First of all we can use the ensembldb package to make a GRanges object containing information on all protein coding genes.

if(!file.exists(sqlite_path)) {
    ## generate the SQLite database file
    ensembldb::ensDbFromGtf(gtf=gtf.file, path=ref_dir, outfile=sqlite_file)
}
EnsDb.Hsapiens.v84 <- ensembldb::EnsDb(sqlite_path)
ag <- ensembldb::genes(EnsDb.Hsapiens.v84, filter=list(GenebiotypeFilter('protein_coding'))) 
ag

## GRanges object with 19826 ranges and 6 metadata columns:
##                   seqnames               ranges strand |         gene_id
##                      <Rle>            <IRanges>  <Rle> |     <character>
##   ENSG00000186092        1     [ 69091,  70008]      + | ENSG00000186092
##   ENSG00000279928        1     [182393, 184158]      + | ENSG00000279928
##   ENSG00000279457        1     [184923, 200322]      - | ENSG00000279457
##   ENSG00000278566        1     [450740, 451678]      - | ENSG00000278566
##   ENSG00000273547        1     [685716, 686654]      - | ENSG00000273547
##               ...      ...                  ...    ... .             ...
##   ENSG00000205916        Y [24833843, 24907040]      + | ENSG00000205916
##   ENSG00000185894        Y [25030901, 25062548]      - | ENSG00000185894
##   ENSG00000279115        Y [25307702, 25308107]      + | ENSG00000279115
##   ENSG00000280301        Y [25463994, 25473714]      + | ENSG00000280301
##   ENSG00000172288        Y [25622162, 25624902]      + | ENSG00000172288
##                     gene_name  entrezid   gene_biotype seq_coord_system
##                   <character> <integer>    <character>        <integer>
##   ENSG00000186092       OR4F5      <NA> protein_coding             <NA>
##   ENSG00000279928  FO538757.3      <NA> protein_coding             <NA>
##   ENSG00000279457  FO538757.2      <NA> protein_coding             <NA>
##   ENSG00000278566      OR4F29      <NA> protein_coding             <NA>
##   ENSG00000273547      OR4F16      <NA> protein_coding             <NA>
##               ...         ...       ...            ...              ...
##   ENSG00000205916        DAZ4      <NA> protein_coding             <NA>
##   ENSG00000185894       BPY2C      <NA> protein_coding             <NA>
##   ENSG00000279115  AC006386.1      <NA> protein_coding             <NA>
##   ENSG00000280301  AC006328.4      <NA> protein_coding             <NA>
##   ENSG00000172288        CDY1      <NA> protein_coding             <NA>
##                        symbol
##                   <character>
##   ENSG00000186092       OR4F5
##   ENSG00000279928  FO538757.3
##   ENSG00000279457  FO538757.2
##   ENSG00000278566      OR4F29
##   ENSG00000273547      OR4F16
##               ...         ...
##   ENSG00000205916        DAZ4
##   ENSG00000185894       BPY2C
##   ENSG00000279115  AC006386.1
##   ENSG00000280301  AC006328.4
##   ENSG00000172288        CDY1
##   -------
##   seqinfo: 40 sequences from GRCh38 genome

Then we can count the number of reads in the exons of all genes, and those in the exons of protein coding genes.

#count of reads in exons
fc_mat <- fc_ensembl_84$counts
workflow_stats$step06_readsinexons <- sum(fc_mat)

#count of reads in protein coding exons
fc_mat_pc <- fc_mat[rownames(fc_mat) %in% ag$gene_id,]
workflow_stats$step07_readsinproteincodingexons <- sum(fc_mat_pc)

Evaluate the performance of the workflow

During the workflow we have recorded the number of reads retained at each point and this is summarised in the plot below. Eventually around 50% of reads map to a protein coding exon, although it is possible that this number could be increased by optimising the library preparation. The libraries sequenced in this experiment had an insert size range from 122-1000bp with an average size of 186bp, whereas this can be increaed to 122-1500bp (avg 259bp) or 122-2000bp (avg 355bp). Since around 20% of reads are lost in the first quality and size filtering step, a longer insert size could reduce this attrition.

workflow_stats_df <- workflow_stats
workflow_stats_df$step05_alignscores <- workflow_stats_df$step05_alignscores[1,'NumMapped']
workflow_stats_df <- as.data.frame(workflow_stats_df) %>% 
    mutate(fn=fq_fn) %>%
    tidyr::gather(step, readcount,-fn)

library(ggplot2)
ggplot(workflow_stats_df, aes(x=step, y=readcount, fill=step)) + 
    geom_bar(stat='identity') + 
    geom_text(aes(label=round(readcount/10^6,3))) +
    cowplot::theme_cowplot()  + 
    xlab('') + ggtitle("Lexogen 3'Quantseq Read Processing") +
    theme(legend.position='none',
          axis.text.x=element_text(angle=330, vjust=1, hjust=0),
          plot.margin = unit(c(2, 5, 0, 0), "lines"))

Conclusion

This vignette demonstrates a prototype analysis pipeline developed for the novel Quantseq 3’UTR RNA sequencing kit from Lexogen. Although many approaches are available for the analysis of NGS data, R/Bioconductor has the advantage of being able to work from end to end in the same environment. Its strong visualisation capabilities and high level functions allow the impact of analysis choices at each step to be readily understood during the prototyping stage. Once prototyping is complete, a production pipeline could be implemented using the BiocParallel package with BatchJobs serving as the interface between R and a wide range of queuing software including SGE and Torque.

The workflow developed here for Lexogen Quantseq data showed that around 50% of reads generated are available for subsequent count based differential expression analysis analysis, and that modifying the insert size at the library preparation stage may increase this proportion.

Session Info

sessionInfo()

## R version 3.3.1 (2016-06-21)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: OS X 10.11.4 (El Capitan)
## 
## locale:
## [1] en_GB.UTF-8
## 
## attached base packages:
## [1] stats4    parallel  stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] ensembldb_1.6.2            GenomicFeatures_1.26.0    
##  [3] AnnotationDbi_1.36.0       ggplot2_2.2.0             
##  [5] tidyr_0.6.0                dplyr_0.5.0               
##  [7] ShortRead_1.32.0           GenomicAlignments_1.10.0  
##  [9] SummarizedExperiment_1.4.0 Biobase_2.34.0            
## [11] Rsamtools_1.26.1           GenomicRanges_1.26.1      
## [13] GenomeInfoDb_1.10.1        Biostrings_2.42.1         
## [15] XVector_0.14.0             IRanges_2.8.1             
## [17] S4Vectors_0.12.1           BiocParallel_1.8.1        
## [19] BiocGenerics_0.20.0        Rsubread_1.24.0           
## 
## loaded via a namespace (and not attached):
##  [1] lattice_0.20-34               colorspace_1.3-1             
##  [3] htmltools_0.3.5               rtracklayer_1.34.1           
##  [5] yaml_2.1.14                   interactiveDisplayBase_1.12.0
##  [7] XML_3.98-1.5                  DBI_0.5-1                    
##  [9] RColorBrewer_1.1-2            plyr_1.8.4                   
## [11] stringr_1.1.0                 zlibbioc_1.20.0              
## [13] munsell_0.4.3                 gtable_0.2.0                 
## [15] hwriter_1.3.2                 evaluate_0.10                
## [17] memoise_1.0.0                 labeling_0.3                 
## [19] latticeExtra_0.6-28           knitr_1.15.1                 
## [21] biomaRt_2.30.0                httpuv_1.3.3                 
## [23] BiocInstaller_1.24.0          Rcpp_0.12.8                  
## [25] xtable_1.8-2                  scales_0.4.1                 
## [27] backports_1.0.4               mime_0.5                     
## [29] AnnotationHub_2.6.4           digest_0.6.10                
## [31] stringi_1.1.2                 shiny_0.14.2                 
## [33] cowplot_0.7.0                 grid_3.3.1                   
## [35] rprojroot_1.1                 tools_3.3.1                  
## [37] bitops_1.0-6                  magrittr_1.5                 
## [39] lazyeval_0.2.0                RCurl_1.95-4.8               
## [41] tibble_1.2                    RSQLite_1.1                  
## [43] Matrix_1.2-7.1                assertthat_0.1               
## [45] rmarkdown_1.2                 httr_1.2.1                   
## [47] R6_2.2.0