1 Introduction

High throughput RNA sequencing experiments have allowed the discovery of many new RNA species. Current bioinformatic pipelines typically involve aligning these sequencing reads to the genome and assembling contiguous reads into transcript architectures. This will output full descriptions of transcript and exon coordinates in Gene Transfer Format (GTF). Further analyses including differential transcript expression can be performed with these custom-assembled transcript information. Despite having rich details on transcript architectures, these custom-assembled transcripts lack functional information including coding potential.

To this end, we developed a program with tools to Functionally Annotate Custom-assembled Transcripts in R (factR). At its core, factR will construct CDS features for assembled transcriptomes using a reference annotation as guide. With this newly-built coding information, we may then query transcripts for protein domain features, for nonsense-mediated decay inducing features and for upstream open-reading frames. factR also contain tools to annotate alternative segments between transcripts, to auto-fill gene metadata and other tools to visualize particular transcript families from a GTF file.

2 Getting started

2.1 How to install and load factR and other dependencies

To install factR, open R and type:

The above installation will also install packages needed to run factR. In addition, you would need to install the following genome sequence for this workflow:

Next, we need to load the following packages:

2.2 What inputs you need

2.2.1 Query transcript assembly

Transcript assembly programs typically outputs a GTF file containing transcript and exon coordinates, as well as attributes for each entry. It is mandatory that your assembly contain gene_id and transcript_id attributes.

Query GTF file have to be imported into R and parsed into a GenomicRanges object. To do so, run:

2.2.2 Reference annotation

Reference transcript annotation serves as a guide in the construction of query CDS. We strongly recommend using transcript annotation in GTF/GFF3 format for the workflow. Reference GTF annotation can be acquired from the Annotation Hub database in R or downloaded locally from data server website via FTP. For the latter, the GTF file have to be imported into R and parsed into a GenomicRanges object.

Here, we will describe three ways to obtain mouse annotation:

  1. vM6 Gencode basic annotation from Annotation Hub database
  1. GRCm38 Ensembl annotation from Annotation Hub database
  1. Import locally downloaded GTF annotation into R

2.2.3 Genomic sequence

Genomic DNA sequence is required for determining open-reading frames and for in silico translation of query transcripts. Here, we describe three ways to obtain genomic sequence for mouse GRCm38 assembly:

  1. BSgenome sequence object from UCSC The package BSgenome.Mmusculus.UCSC.mm10 contains genome sequence for mouse mm10.
  1. TwoBit fasta sequence (Ensembl) from Annotation Hub
  1. Import local FASTA genome sequence file

Note: when using DNA FASTA from Ensembl, do check header lines for entry names. If long names are given (eg. 1 dna:chromosome chromosome:GRCm38:1:1:195471971:1 REF), you would need to convert them to short names in order to have consistent sequence levels. This can be done by running the following:

3 Sample datasets

3.1 Query

factR package contains a sample assembled GTF imported as a GRanges object. The assembly was constructed from a mouse dataset and contain transcript and exon descriptions of 4 transcript isoforms from the same parent gene. Transcripts in query_gtf have been assigned generic gene_ids and transcript_ids. This is to recapitulate the nomenclature of a typical custom-assembled GTF.

Users can quickly visualize the exon structure of these query transcripts using the viewTranscripts tool, which is essentially a convenient wrapper to plot transcripts using the wiggleplotr package:

4 Matching chromosome IDs

Different genomic objects may have different chromosome IDs (aka seqlevels), depending on its source. Users may have a custom GTF with “Ensembl”-style seqlevels but wish to use a gencode reference annotation which contain “UCSC”-style seqlevels. The sample query_gtf and ref_gtf objects presents this exact predicament:

factR contain a tool to effectively match the seqlevels of a query GRanges object to any reference object with seqlevel information:

Note: matchChromosomes will drop any unmatched seqlevels from query

factR also comes with the function has_consistentSeqlevels to test if two or more objects have consistent seqlevels. This function returns a logical value of FALSE if input objects have inconsistent seqlevels and recommends a matchChromosomes function to run.

5 Matching gene metadata

Custom-built GTF annotation may contain generic gene_id and gene_name metadata (e.g. ‘MSTRG1’), depending on how it was constructed. Users who wish to fill these metadata using a reference annotation may run the matchGeneInfo function:

By default, this function will match query transcripts to reference by overlapping coordinates. Users may improve the accuracy of this matching function by providing inputs to primary_gene_id and secondary_gene_id arguments. See ?matchGeneMeta for detailed description.

6 Extract newly-discovered transcripts from query transcriptome

Very often, custom-assembled transcriptomes contain de novo isoforms. It is useful to subset the dataset for these newly-discovered transcripts to streamline downstream steps. To perform this, run:

7 Construct query CDS information

To build CDS information on our query data, users can simply run:

The above function will return a new GRanges object containing exon entries from query_gtf with appended cds entries. User can inspect the contents of the object further.

User can visualize the coding structure of each transcripts using the viewTranscripts tool:

The new query GTF Granges object can be exported as a .gtf file using rtracklayer’s export funcition:

8 Predicting protein domain structure

Upon construction of coding segments on transcripts, users may be interested in determining the functional domains which is encoded within the coding sequence. To this end, users can utilize the predictDomains function. This tool accepts a GTF GRanges object containing CDS information as input, together with a genome sequence object. After checking the translated protein sequence, predictDomains will query the SuperFamily database for domain features and plot its domain architecture:

In order to extract the domain features, predictDomains will attempt to submit the protein coding sequence to the SuperFamily database and return a summary of the domain structure. As this process is done over HTTP, the speed of this function is greatly dependent on the server and on the number of CDS in your list. This process can be sped up by subsetting the list of transcripts for analysis by providing logical conditions to the argument. This argument accepts unquoted metadata column names from the GTF as variables and quoted values to filter the GTF by. For example:

Alternatively, users may reduce the number of transcripts for the query by subsetting the input GTF for de novo transcripts.

9 Predicting transcript sensitivity to nonsense-mediated decay

Coding RNAs harbouring a premature stop codon are potentially eliminated by a process called Nonsense-mediated decay (NMD). Two common features of NMD targets include the upstream position of its stop_codon relative to the last exon-exon junction and the extensive length of its 3’-untranslated region (3’UTR). To report these NMD features and predict if any of the transcripts in the custom annotation are NMD-sensitive, users can run the predictNMD tool.

predictNMD can accept a GTF GRanges object as input and will only analyze coding transcripts for NMD-features.

predictNMD will output a tibble dataframe with report on NMD-inducing feature for each coding transcript. The column stop_to_lastEJ describes the relative distance of the stop_codon to the last exon-exon junction (lastEJ) of each transcript. Thus, a positive integer means that the lastEJ is downstream of the stop_codon. predictNMD also returns the number of exon-junctions downstream of the stop_codon (num_of_downEJs). 3'UTR_length describes the length of the 3’-untranslated region for each coding transcript. The column containing logical values for is_NMD will return TRUE if the stop_to_lastEJ value exceed the NMD triggering threshold which is typically 50 base-pair. Users may set a custom value by overriding the default NMD_threshold argument.

Users who wish to analyze certain genes or transcripts in their GTF GRanges for NMD features may provide logical conditions to the ... argument to subset their GTF input. This argument accepts unquoted metadata column names from the GTF as variables and quoted values to filter the GTF by. For example:

This feature provides user the flexibility to analyze desired genes/transcripts from a large collection of GTF GRanges. Input GTF can be filtered with multiple conditions as long as variable names are found in the input GRanges object. Here are some examples on how this feature can be used on an Ensembl GTF annotation:

10 Session Information

All of the output in this vignette was produced under the following conditions:

sessionInfo()
#> R version 4.0.3 (2020-10-10)
#> Platform: x86_64-apple-darwin17.0 (64-bit)
#> Running under: macOS Mojave 10.14.6
#> 
#> Matrix products: default
#> BLAS:   /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRblas.dylib
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
#> 
#> locale:
#> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#> 
#> attached base packages:
#> [1] stats4    parallel  stats     graphics  grDevices utils     datasets 
#> [8] methods   base     
#> 
#> other attached packages:
#>  [1] GenomicFeatures_1.42.1             AnnotationDbi_1.52.0              
#>  [3] Biobase_2.50.0                     BSgenome.Mmusculus.UCSC.mm10_1.4.0
#>  [5] BSgenome_1.58.0                    rtracklayer_1.50.0                
#>  [7] GenomicRanges_1.42.0               GenomeInfoDb_1.26.0               
#>  [9] Biostrings_2.58.0                  XVector_0.30.0                    
#> [11] IRanges_2.24.0                     S4Vectors_0.28.0                  
#> [13] AnnotationHub_2.22.0               BiocFileCache_1.14.0              
#> [15] dbplyr_2.0.0                       BiocGenerics_0.36.0               
#> [17] factR_0.0.0.9000                  
#> 
#> loaded via a namespace (and not attached):
#>  [1] MatrixGenerics_1.2.0          httr_1.4.2                   
#>  [3] bit64_4.0.5                   shiny_1.5.0                  
#>  [5] assertthat_0.2.1              askpass_1.1                  
#>  [7] interactiveDisplayBase_1.28.0 BiocManager_1.30.10          
#>  [9] blob_1.2.1                    GenomeInfoDbData_1.2.4       
#> [11] Rsamtools_2.6.0               progress_1.2.2               
#> [13] yaml_2.2.1                    BiocVersion_3.12.0           
#> [15] pillar_1.4.6                  RSQLite_2.2.1                
#> [17] lattice_0.20-41               glue_1.4.2                   
#> [19] digest_0.6.27                 promises_1.1.1               
#> [21] wiggleplotr_1.14.0            colorspace_2.0-0             
#> [23] htmltools_0.5.0               httpuv_1.5.4                 
#> [25] Matrix_1.2-18                 XML_3.99-0.5                 
#> [27] pkgconfig_2.0.3               biomaRt_2.46.0               
#> [29] bookdown_0.21                 zlibbioc_1.36.0              
#> [31] purrr_0.3.4                   xtable_1.8-4                 
#> [33] scales_1.1.1                  later_1.1.0.1                
#> [35] BiocParallel_1.24.1           openssl_1.4.3                
#> [37] tibble_3.0.4                  farver_2.0.3                 
#> [39] ggplot2_3.3.2                 generics_0.1.0               
#> [41] ellipsis_0.3.1                SummarizedExperiment_1.20.0  
#> [43] cli_2.1.0                     magrittr_1.5                 
#> [45] crayon_1.3.4                  mime_0.9                     
#> [47] memoise_1.1.0                 evaluate_0.14                
#> [49] fansi_0.4.1                   xml2_1.3.2                   
#> [51] prettyunits_1.1.1             tools_4.0.3                  
#> [53] data.table_1.13.2             hms_0.5.3                    
#> [55] lifecycle_0.2.0               matrixStats_0.57.0           
#> [57] stringr_1.4.0                 munsell_0.5.0                
#> [59] DelayedArray_0.16.0           compiler_4.0.3               
#> [61] rlang_0.4.8                   grid_4.0.3                   
#> [63] RCurl_1.98-1.2                drawProteins_1.10.0          
#> [65] rappdirs_0.3.1                labeling_0.4.2               
#> [67] bitops_1.0-6                  rmarkdown_2.5                
#> [69] gtable_0.3.0                  DBI_1.1.0                    
#> [71] curl_4.3                      R6_2.5.0                     
#> [73] GenomicAlignments_1.26.0      knitr_1.30                   
#> [75] dplyr_1.0.2                  
#>  [ reached getOption("max.print") -- omitted 9 entries ]