This workflow describes how the analysis of data from Lexogen’s 3’ Quantseq technique generated on an Illumina Nextseq can be analysed using R/Bioconductor. The Lexogen approach generates reads from the 3’UTR of an mRNA transcript and can be used for count-based differential expression analysis. Lexogen Quantseq libraries cost ~£15 per sample, and this low cost allows more samples to be profiled for a given budget. Larger experiments permit more replicates, more time points, more compounds, or more cell lines to be included in an experiment, potentially making it more informative.
See the Lexogen website and a Nature Methods advertising feature for more information. This figure below is a schematic of the library preparation protocol:
lib prep
The analysis is primary carried out using the ShortRead and RSubread Bioconductor packages for NGS analysis with ggplot2 and the cowplot extension package used for visualisation. The author thanks Martin Morgan, the ShortRead package author, for his assistance in developing this workflow.
Locations of various files and directories.
setwd('~/BigData/Scratch/2016Lexogen/')
fastq_dir <- '/Users/pchapman/BigData/Lexogen/raw_fastq'
ref_fa <- '/Users/pchapman/BigData/ENSEMBL/hs84/Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa'
ref_dir <- '/Users/pchapman/BigData/ENSEMBL/hs84'
gtf.file <- file.path(ref_dir, "Homo_sapiens.GRCh38.84.gtf")
sqlite_file <- 'Homo_sapiens.GRCh38.84.sqlite'
sqlite_path <- file.path(ref_dir, sqlite_file)
workflow_stats <- list()
Here we define some ggplot2 functions based on the qa function of the ShortRead package.
qa_plots <- function(shortfq_obj) {
x <- ShortRead::qa(shortfq_obj, lane='')
nuc_plot <- ggplot(x[['perCycle']]$baseCall, aes(x=Cycle, y=Count, colour=Base)) + geom_line() + cowplot::theme_cowplot()
qual_plot <- x[['perCycle']]$quality %>% group_by(Cycle) %>% summarise(QualScore=sum(Score*Count)/sum(Count)) %>% ungroup() %>%
ggplot(aes(x=Cycle, y=QualScore)) + geom_line() + cowplot::theme_cowplot()
size_plot <- ggplot(data.frame(frag_size=width(shortfq_obj)), aes(x=frag_size)) +
geom_bar(fill='pink', colour='black') + cowplot::theme_cowplot()
base_plot <- ggplot(data.frame(idx=1:50, dat=as.character(sread(shortfq_obj)[1:50]))) +
geom_text(aes(x=1, y=idx, label=dat), size=rel(2.5), family='mono', hjust=0) + xlim(1,1.8) +
theme_bw() + theme(axis.text.x=element_text(size=0)) + xlab('')
return(list(nuc_plot=nuc_plot, qual_plot=qual_plot, size_plot=size_plot, base_plot=base_plot))
}
The ShortRead::yield function is used to load a subset of 1M 151bp single end Nextseq reads from a .fastq file.
fq_fn <- grep('DO9_', list.files(fastq_dir, 'fastq.gz'), value = TRUE)[1]
fq_path <- file.path(fastq_dir, fq_fn)
fq_data <- ShortRead::yield(FastqSampler(fq_path, 1000000))
We now generate 3 plots using the qa_plots function defined above: * Plot A depicts the proportion of reads with each nucleotide at positions 1-151. It is evident that G’s predominate towards the end of the reads, and A’s increase up to around 60bp before reducing. The read distribution seems to change markedly at around position 10. * Plot B depicts the average quality score at positions 1-151 across all reads. Quality reduces towards the end of the read with an uptick towards the very end, and the first 10bp seem to have lower quality than the second 10bp. * Plot C is a histogram of read lengths: all reads are 151bp at this stage.
qp <- qa_plots(fq_data)
## Warning: closing unused connection 5 (/Users/pchapman/BigData/Lexogen/
## raw_fastq/DO9_2_DMSO_ctrl_5Day_S2_L003_R1_001.fastq.gz)
cowplot::plot_grid(qp$nuc_plot, qp$qual_plot, qp$size_plot, labels='auto')
Viewing a sample of 50 reads demonstrates what can be seen in the summary plots above. Many reads have strings of G’s towards the end, and poly-A tails are also in evidence. The NextSeq sequencing chemistry returns a G when there is very low signal, suggesting that in fact the increase in incidence of G’s is related to sequence quality rather than really being sequence related.
qp$base_plot
It is also useful to view the most frequent sequences which shows that strings of A’s followed by G’s are very common:
ShortRead:::.freqSequences(qa(fq_data, lane=''), "read")
## sequence
## 1 GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
## 2 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
## 3 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
## 4 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
## 5 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
## 6 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
## 7 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
## 8 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGAGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
## 9 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGAGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
## 10 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGAGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
## 11 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
## 12 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
## 13 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
## 14 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
## 15 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
## 16 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGAGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
## 17 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGAGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
## 18 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
## 19 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGAGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
## 20 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
## count lane
## 1 334
## 2 133
## 3 124
## 4 120
## 5 118
## 6 101
## 7 90
## 8 82
## 9 81
## 10 76
## 11 73
## 12 68
## 13 62
## 14 61
## 15 61
## 16 57
## 17 54
## 18 53
## 19 50
## 20 50
The first 12bp are in fact adapter sequence, hence the market shift in both quality and base distributions at this point. These can be removed using the ShortRead::narrow function:
fq <- fq_data
fq <- ShortRead::narrow(fq, start=13)
workflow_stats$step01_start <- length(fq)
length(fq)
## [1] 1000000
Regenerating the qa plots confirms that this has happened and we now have a 141 bp read size:
qp <- qa_plots(fq)
cowplot::plot_grid(qp$nuc_plot, qp$qual_plot, qp$size_plot, labels='auto')
The ShortRead::trimTailw function removes low-quality reads from the 3’ end using a sliding window of nucleotides falling at or below a quality threshold. Here we define a sliding window size of 12 by setting halfwidth to 6, a number of nucleotides as 6, and a quality threshold of 4. We also remove reads that are now below 36bp in length:
fq <- ShortRead::trimTailw(fq, k=6, a="4", halfwidth=6)
fq <- fq[width(fq) >= 36]
workflow_stats$step02_poorqual <- length(fq)
length(fq)
## [1] 829306
The quality plots now look markedly different before with the relative proportions of the nucleotides looking much more even, although A’s are still over-represented since we haven’t removed the poly-A sequences yet. Quality still declines but now levels out as would be expected given our filtering. There also appear to be a subset of full length reads of high quality as indicated by the uptick at the right hand side of plots B and C.
qp <- qa_plots(fq)
cowplot::plot_grid(qp$nuc_plot, qp$qual_plot, qp$size_plot, labels='auto')
It is useful to eyeball a sample of reads at this point to confirm the removal of the low quality G bases and the retention of the poly-A sequences.
qp$base_plot
The poly-A tails can be identified using the ShortRead::trimEnds function to examine the right end of the reads and return the location of any A’s. Setting ranges = TRUE returns a ranges object which can then be fed into the ShortReads::narrow function to do the actual clipping.
narrow_ranges <- ShortRead::trimEnds(sread(fq), right=TRUE, "A", relation="==", ranges=TRUE)
fq <- ShortRead::narrow(fq, start(narrow_ranges), end(narrow_ranges))
fq <- fq[width(fq) >= 36]
workflow_stats$step03_polya <- length(fq)
length(fq)
## [1] 745519
Plot A shows that we have succeeded in reducing the over-representation of A’s in the first at around 50 cycles.
qp <- qa_plots(fq)
cowplot::plot_grid(qp$nuc_plot, qp$qual_plot, qp$size_plot, labels='auto')
Looking at the reads we can confirm the removal of poly-A’s , however, there are still some embedded poly-A sequences that are 3’flanked by non-A nucleotides which need to be removed.
qp$base_plot
Here we apply the regexpr function to the read sequences (using ShortRead::sread) to identify the first position of a polyA of length 10. This returns a vector of the same length as the ShortReadQ object which is either the start position of the polyA or -1 if no polyA was present. This vector can be combined with the length of the read and fed into the ShortRead::narrow function to clip those reads that need to be clipped. This gives the final set of reads for mapping.
polyApos <- regexpr(pattern= 'AAAAAAAAAA', sread(fq), fixed=TRUE) %>% unlist()
polyAclip_idx <- which(polyApos >= 0)
polyAclip <- width(fq)
polyAclip[polyAclip_idx] <- polyApos[polyAclip_idx]
fq <- narrow(fq, end=polyAclip)
fq <- fq[width(fq) >= 36]
workflow_stats$step04_3padapt <- length(fq)
length(fq)
## [1] 721162
We see a slight improvement in the quality plots.
qp <- qa_plots(fq)
cowplot::plot_grid(qp$nuc_plot, qp$qual_plot, qp$size_plot, labels='auto')
And confirm that embedded poly-A’s have been removed.
qp$base_plot
At this point we have data that can be fed into a standard single end RNAseq workflow. RSubread is a useful aligner for single end RNAseq data since it doesn’t try to map across splice junctions
tmp_fastq <- tempfile()
writeFastq(fq, tmp_fastq)
tmp_bam <- tempfile()
align(index=file.path(ref_dir,"hs84_subread_index"),
readfile1=tmp_fastq,
input_format='gzFASTQ',
output_file=tmp_bam,
nthreads=8)
##
## ========== _____ _ _ ____ _____ ______ _____
## ===== / ____| | | | _ \| __ \| ____| /\ | __ \
## ===== | (___ | | | | |_) | |__) | |__ / \ | | | |
## ==== \___ \| | | | _ <| _ /| __| / /\ \ | | | |
## ==== ____) | |__| | |_) | | \ \| |____ / ____ \| |__| |
## ========== |_____/ \____/|____/|_| \_\______/_/ \_\_____/
## Rsubread 1.24.0
##
## //========================== subread-align setting ===========================\\
## || ||
## || Function : Read alignment (RNA-Seq) ||
## || Input file : /var/folders/9d/hs3g06m50095p0_tphcpyzy80000gp/T//Rtmp ... ||
## || Output file : /var/folders/9d/hs3g06m50095p0_tphcpyzy80000gp/T//Rtmp ... ||
## || Index name : /Users/pchapman/BigData/ENSEMBL/hs84/hs84_subread_index ||
## || ||
## || Threads : 8 ||
## || Phred offset : 33 ||
## || Min votes : 3 / 10 ||
## || Maximum allowed mismatches : 3 ||
## || Maximum allowed indel bases : 5 ||
## || # of best alignments reported : 1 ||
## || Unique mapping : yes ||
## || ||
## \\===================== http://subread.sourceforge.net/ ======================//
##
## //====================== Running (08-Dec-2016 07:20:26) ======================\\
## || ||
## || The input file contains base space reads. ||
## || The range of Phred scores observed in the data is [2,36] ||
## || Load the 1-th index block... ||
## || 7% completed, 0.3 mins elapsed, rate=10.6k reads per second ||
## || 13% completed, 0.3 mins elapsed, rate=17.2k reads per second ||
## || 20% completed, 0.3 mins elapsed, rate=22.3k reads per second ||
## || 27% completed, 0.4 mins elapsed, rate=25.2k reads per second ||
## || 34% completed, 0.4 mins elapsed, rate=27.7k reads per second ||
## || 40% completed, 0.4 mins elapsed, rate=30.3k reads per second ||
## || 47% completed, 0.4 mins elapsed, rate=31.0k reads per second ||
## || 53% completed, 0.4 mins elapsed, rate=32.7k reads per second ||
## || 60% completed, 0.4 mins elapsed, rate=34.3k reads per second ||
## || 70% completed, 0.5 mins elapsed, rate=16.7k reads per second ||
## || 73% completed, 0.5 mins elapsed, rate=16.7k reads per second ||
## || 76% completed, 0.5 mins elapsed, rate=16.8k reads per second ||
## || 80% completed, 0.6 mins elapsed, rate=16.8k reads per second ||
## || 83% completed, 0.6 mins elapsed, rate=17.0k reads per second ||
## || 86% completed, 0.6 mins elapsed, rate=17.1k reads per second ||
## || 90% completed, 0.6 mins elapsed, rate=17.2k reads per second ||
## || 93% completed, 0.6 mins elapsed, rate=17.3k reads per second ||
## || 96% completed, 0.7 mins elapsed, rate=17.4k reads per second ||
## || 99% completed, 0.7 mins elapsed, rate=17.5k reads per second ||
## || ||
## || Completed successfully. ||
## || ||
## \\============================================================================//
##
## //================================= Summary ==================================\\
## || ||
## || Processed : 721,162 reads ||
## || Mapped : 588,889 reads (81.7%) ||
## || Indels : 2,612 ||
## || ||
## || Running time : 0.7 minutes ||
## || ||
## \\===================== http://subread.sourceforge.net/ ======================//
workflow_stats$step05_alignscores <- propmapped(tmp_bam)
## The input file is opened as a BAM file.
## The fragments in the input file are being counted.
## Finished. All records: 721162; all fragments: 721162; mapped fragments: 588889; the mappability is 81.66%
Use RSubread::featureCounts to count features mapping to exons: note that strand-specific is set to 1 whereas, depending on the library preparation method, this is often set to 2 for poly-A RNAseq.
fc_ensembl_84 <- featureCounts(files=tmp_bam,
annot.ext=gtf.file,
isGTFAnnotationFile=TRUE,
GTF.featureType="exon",
GTF.attrType="gene_id",
useMetaFeatures=TRUE,
allowMultiOverlap=FALSE,
nthreads=8,
strandSpecific=1,
countMultiMappingReads=FALSE)
##
## ========== _____ _ _ ____ _____ ______ _____
## ===== / ____| | | | _ \| __ \| ____| /\ | __ \
## ===== | (___ | | | | |_) | |__) | |__ / \ | | | |
## ==== \___ \| | | | _ <| _ /| __| / /\ \ | | | |
## ==== ____) | |__| | |_) | | \ \| |____ / ____ \| |__| |
## ========== |_____/ \____/|____/|_| \_\______/_/ \_\_____/
## Rsubread 1.24.0
##
## //========================== featureCounts setting ===========================\\
## || ||
## || Input files : 1 BAM file ||
## || S /var/folders/9d/hs3g06m50095p0_tphcpyzy800 ... ||
## || ||
## || Output file : ./.Rsubread_featureCounts_pid4084 ||
## || Summary : ./.Rsubread_featureCounts_pid4084.summary ||
## || Annotation : /Users/pchapman/BigData/ENSEMBL/hs84/Homo_sa ... ||
## || Dir for temp files : . ||
## || ||
## || Threads : 8 ||
## || Level : meta-feature level ||
## || Paired-end : no ||
## || Strand specific : stranded ||
## || Multimapping reads : not counted ||
## || Multi-overlapping reads : not counted ||
## || Overlapping bases : 0.0% ||
## || ||
## \\===================== http://subread.sourceforge.net/ ======================//
##
## //================================= Running ==================================\\
## || ||
## || Load annotation file /Users/pchapman/BigData/ENSEMBL/hs84/Homo_sapiens ... ||
## || Features : 1176808 ||
## || Meta-features : 60675 ||
## || Chromosomes/contigs : 59 ||
## || ||
## || Process BAM file /var/folders/9d/hs3g06m50095p0_tphcpyzy80000gp/T//Rtm ... ||
## || Single-end reads are included. ||
## || Assign reads to features... ||
## || Total reads : 721162 ||
## || Successfully assigned reads : 530294 (73.5%) ||
## || Running time : 0.01 minutes ||
## || ||
## || Read assignment finished. ||
## || ||
## \\===================== http://subread.sourceforge.net/ ======================//
First of all we can use the ensembldb package to make a GRanges object containing information on all protein coding genes.
if(!file.exists(sqlite_path)) {
## generate the SQLite database file
ensembldb::ensDbFromGtf(gtf=gtf.file, path=ref_dir, outfile=sqlite_file)
}
EnsDb.Hsapiens.v84 <- ensembldb::EnsDb(sqlite_path)
ag <- ensembldb::genes(EnsDb.Hsapiens.v84, filter=list(GenebiotypeFilter('protein_coding')))
ag
## GRanges object with 19826 ranges and 6 metadata columns:
## seqnames ranges strand | gene_id
## <Rle> <IRanges> <Rle> | <character>
## ENSG00000186092 1 [ 69091, 70008] + | ENSG00000186092
## ENSG00000279928 1 [182393, 184158] + | ENSG00000279928
## ENSG00000279457 1 [184923, 200322] - | ENSG00000279457
## ENSG00000278566 1 [450740, 451678] - | ENSG00000278566
## ENSG00000273547 1 [685716, 686654] - | ENSG00000273547
## ... ... ... ... . ...
## ENSG00000205916 Y [24833843, 24907040] + | ENSG00000205916
## ENSG00000185894 Y [25030901, 25062548] - | ENSG00000185894
## ENSG00000279115 Y [25307702, 25308107] + | ENSG00000279115
## ENSG00000280301 Y [25463994, 25473714] + | ENSG00000280301
## ENSG00000172288 Y [25622162, 25624902] + | ENSG00000172288
## gene_name entrezid gene_biotype seq_coord_system
## <character> <integer> <character> <integer>
## ENSG00000186092 OR4F5 <NA> protein_coding <NA>
## ENSG00000279928 FO538757.3 <NA> protein_coding <NA>
## ENSG00000279457 FO538757.2 <NA> protein_coding <NA>
## ENSG00000278566 OR4F29 <NA> protein_coding <NA>
## ENSG00000273547 OR4F16 <NA> protein_coding <NA>
## ... ... ... ... ...
## ENSG00000205916 DAZ4 <NA> protein_coding <NA>
## ENSG00000185894 BPY2C <NA> protein_coding <NA>
## ENSG00000279115 AC006386.1 <NA> protein_coding <NA>
## ENSG00000280301 AC006328.4 <NA> protein_coding <NA>
## ENSG00000172288 CDY1 <NA> protein_coding <NA>
## symbol
## <character>
## ENSG00000186092 OR4F5
## ENSG00000279928 FO538757.3
## ENSG00000279457 FO538757.2
## ENSG00000278566 OR4F29
## ENSG00000273547 OR4F16
## ... ...
## ENSG00000205916 DAZ4
## ENSG00000185894 BPY2C
## ENSG00000279115 AC006386.1
## ENSG00000280301 AC006328.4
## ENSG00000172288 CDY1
## -------
## seqinfo: 40 sequences from GRCh38 genome
Then we can count the number of reads in the exons of all genes, and those in the exons of protein coding genes.
#count of reads in exons
fc_mat <- fc_ensembl_84$counts
workflow_stats$step06_readsinexons <- sum(fc_mat)
#count of reads in protein coding exons
fc_mat_pc <- fc_mat[rownames(fc_mat) %in% ag$gene_id,]
workflow_stats$step07_readsinproteincodingexons <- sum(fc_mat_pc)
During the workflow we have recorded the number of reads retained at each point and this is summarised in the plot below. Eventually around 50% of reads map to a protein coding exon, although it is possible that this number could be increased by optimising the library preparation. The libraries sequenced in this experiment had an insert size range from 122-1000bp with an average size of 186bp, whereas this can be increaed to 122-1500bp (avg 259bp) or 122-2000bp (avg 355bp). Since around 20% of reads are lost in the first quality and size filtering step, a longer insert size could reduce this attrition.
workflow_stats_df <- workflow_stats
workflow_stats_df$step05_alignscores <- workflow_stats_df$step05_alignscores[1,'NumMapped']
workflow_stats_df <- as.data.frame(workflow_stats_df) %>%
mutate(fn=fq_fn) %>%
tidyr::gather(step, readcount,-fn)
library(ggplot2)
ggplot(workflow_stats_df, aes(x=step, y=readcount, fill=step)) +
geom_bar(stat='identity') +
geom_text(aes(label=round(readcount/10^6,3))) +
cowplot::theme_cowplot() +
xlab('') + ggtitle("Lexogen 3'Quantseq Read Processing") +
theme(legend.position='none',
axis.text.x=element_text(angle=330, vjust=1, hjust=0),
plot.margin = unit(c(2, 5, 0, 0), "lines"))
This vignette demonstrates a prototype analysis pipeline developed for the novel Quantseq 3’UTR RNA sequencing kit from Lexogen. Although many approaches are available for the analysis of NGS data, R/Bioconductor has the advantage of being able to work from end to end in the same environment. Its strong visualisation capabilities and high level functions allow the impact of analysis choices at each step to be readily understood during the prototyping stage. Once prototyping is complete, a production pipeline could be implemented using the BiocParallel package with BatchJobs serving as the interface between R and a wide range of queuing software including SGE and Torque.
The workflow developed here for Lexogen Quantseq data showed that around 50% of reads generated are available for subsequent count based differential expression analysis analysis, and that modifying the insert size at the library preparation stage may increase this proportion.
sessionInfo()
## R version 3.3.1 (2016-06-21)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: OS X 10.11.4 (El Capitan)
##
## locale:
## [1] en_GB.UTF-8
##
## attached base packages:
## [1] stats4 parallel stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] ensembldb_1.6.2 GenomicFeatures_1.26.0
## [3] AnnotationDbi_1.36.0 ggplot2_2.2.0
## [5] tidyr_0.6.0 dplyr_0.5.0
## [7] ShortRead_1.32.0 GenomicAlignments_1.10.0
## [9] SummarizedExperiment_1.4.0 Biobase_2.34.0
## [11] Rsamtools_1.26.1 GenomicRanges_1.26.1
## [13] GenomeInfoDb_1.10.1 Biostrings_2.42.1
## [15] XVector_0.14.0 IRanges_2.8.1
## [17] S4Vectors_0.12.1 BiocParallel_1.8.1
## [19] BiocGenerics_0.20.0 Rsubread_1.24.0
##
## loaded via a namespace (and not attached):
## [1] lattice_0.20-34 colorspace_1.3-1
## [3] htmltools_0.3.5 rtracklayer_1.34.1
## [5] yaml_2.1.14 interactiveDisplayBase_1.12.0
## [7] XML_3.98-1.5 DBI_0.5-1
## [9] RColorBrewer_1.1-2 plyr_1.8.4
## [11] stringr_1.1.0 zlibbioc_1.20.0
## [13] munsell_0.4.3 gtable_0.2.0
## [15] hwriter_1.3.2 evaluate_0.10
## [17] memoise_1.0.0 labeling_0.3
## [19] latticeExtra_0.6-28 knitr_1.15.1
## [21] biomaRt_2.30.0 httpuv_1.3.3
## [23] BiocInstaller_1.24.0 Rcpp_0.12.8
## [25] xtable_1.8-2 scales_0.4.1
## [27] backports_1.0.4 mime_0.5
## [29] AnnotationHub_2.6.4 digest_0.6.10
## [31] stringi_1.1.2 shiny_0.14.2
## [33] cowplot_0.7.0 grid_3.3.1
## [35] rprojroot_1.1 tools_3.3.1
## [37] bitops_1.0-6 magrittr_1.5
## [39] lazyeval_0.2.0 RCurl_1.95-4.8
## [41] tibble_1.2 RSQLite_1.1
## [43] Matrix_1.2-7.1 assertthat_0.1
## [45] rmarkdown_1.2 httr_1.2.1
## [47] R6_2.2.0