Predicting open reading frames in long reference sequences

Why to find Open Reading Frames (ORFs) in long reference sequences?

Predicting ORFs starts with knowing the starting and ending codons in a genomic loci, this can be useful for finding potential genes de novo. In this, we’ll use systemPipeR, Biostrings and GRanges in a downstream manner that will allow to cross-reference other data such RNAseq.

The dataset is composed of a short (~150 Kb) DNA sequence of Arabidopsis chloroplast genome as input. It is called arabidopsis_choloroplast.fa. (https://github.com/PacktPublishing/R-Bioinformatics-Cookbook/tree/master).

Required Bioconductor installation

if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install(version = "3.21")
BiocManager::install("IRanges") #
BiocManager::install("karyoploteR")
BiocManager::install("Gviz")
BiocManager::install("systemPipeR")


BiocManager::install("Biostrings")


library(SummarizedExperiment)

### Also we need the readr package
install.packages("readr")

## package 'readr' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\USUARIO\AppData\Local\Temp\RtmpSejXM2\downloaded_packages

install.packages("Gviz")
install.packages("magrittr")

## package 'magrittr' successfully unpacked and MD5 sums checked

## 
## The downloaded binary packages are in
##  C:\Users\USUARIO\AppData\Local\Temp\RtmpSejXM2\downloaded_packages

install.packages("ggplot2")

## package 'ggplot2' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\USUARIO\AppData\Local\Temp\RtmpSejXM2\downloaded_packages

library(ggplot2)
library(karyoploteR)
library(GenomicRanges)
library(systemPipeR)
library(Biostrings)

Once we have installed all the packages, we load the datasets (arabidopsis_chroloplast.fa):

dna_object <- readDNAStringSet(file.path(getwd(), "datasets", "arabidopsis_chloroplast.fa")) #First we have to load the data

ORFs prediction

Then, we will be able to predict the ORFs with predORF:

predicted_orfs = predORF(dna_object, n = 'all', type = 'gr',
mode = 'ORF', strand = 'both', longest_disjoint = TRUE)
predicted_orfs

## GRanges object with 2501 ranges and 2 metadata columns:
##           seqnames        ranges strand | subject_id inframe2end
##              <Rle>     <IRanges>  <Rle> |  <integer>   <numeric>
##      1 chloroplast   86762-93358      + |          1           2
##   1162 chloroplast     2056-2532      - |          1           3
##      2 chloroplast   72371-73897      + |          2           2
##   1163 chloroplast   77901-78362      - |          2           1
##      3 chloroplast   54937-56397      + |          3           3
##    ...         ...           ...    ... .        ...         ...
##   2497 chloroplast 129757-129762      - |       1336           3
##   2498 chloroplast 139258-139263      - |       1337           3
##   2499 chloroplast 140026-140031      - |       1338           3
##   2500 chloroplast 143947-143952      - |       1339           3
##   2501 chloroplast 153619-153624      - |       1340           3
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths

# With the both strands, we are looking for all ORFs in + or - DNA strand.

This is an object of GRanges (Genomic Ranges) type, this represents the coordinates of the ORFs found in the sequence of Arabidopsis chloroplast. With predORF() we found 2,501 Open Reading Frames within the chloroplast reference sequence. If we want to extract the sequences we could make a single code line:

orf_seq = getSeq(dna_object,predicted_orfs)
orf_seq

## DNAStringSet object of length 2501:
##        width seq                                            names               
##    [1]  6597 ATGGTCGAAAGCAAAAATCTCT...ATGAAAATAGGATTCATGTAA 1
##    [2]   477 ATGCTAGAAAACTCATTTCTAA...GATTTGGTCAATCATGAATAA 1162
##    [3]  1527 ATGGGTTTGCCTTGGTATCGTG...ACAAAACGACAAGCAGTCTGA 2
##    [4]   462 ATGCCTGTTGAAAATGCCAATC...GACATTCTAGAAAAAAAATAG 1163
##    [5]  1461 ATGACTTGTAGGGAGGGACTTA...AAATTAGATGGCCAAGAGTAG 3
##    ...   ... ...
## [2497]     6 ATGTGA                                         2497
## [2498]     6 ATGTGA                                         2498
## [2499]     6 ATGTGA                                         2499
## [2500]     6 ATGTAA                                         2500
## [2501]     6 ATGTAG                                         2501

Calculating the reference genome properties

bases = c("A", "C", "T", "G")  # We use the typical DNA format
raw_seq_string = strsplit(as.character(dna_object), "")

seq_length = width(dna_object[1])

#For each DNA base, count how many times it appears in the raw_seq_string
counts = lapply(bases, function(x) {
  sum(raw_seq_string[[1]] == x)
})

# Compute relative base counts
probs = unlist(lapply(counts, function(base_count) {
  signif(base_count / seq_length, 2)
}))

Do a function that finds the longest ORF in a simulated genome

get_longest_orf_in_random_genome <- function(x,
length = 1000,
probs = c(0.25, 0.25, 0.25, 0.25),
bases = c("A","C","T","G")){
random_genome <- paste0(sample(bases, size = length, replace =
TRUE, prob = probs), collapse = "")
random_dna_object <- DNAStringSet(random_genome)
names(random_dna_object) <- c("random_dna_string")
orfs <- predORF(random_dna_object, n = 1, type = 'gr',
mode='ORF', strand = 'both', longest_disjoint = TRUE)
return(max(width(orfs)))
}

For practicing, we force the code to run the last function on 10 simulated genomes of 1000 lenght:

random_lengths <- unlist(lapply(1:10,
get_longest_orf_in_random_genome, length = seq_length, probs =
probs, bases = bases))
longest_random_orf <- max(random_lengths) # Get the lenght of the longest random ORF

For keeping only predicted ORFs longer than the longest random ORF we use:

keep <- width(predicted_orfs) > longest_random_orf
orfs_to_keep <- predicted_orfs[keep]
orfs_to_keep

## GRanges object with 14 ranges and 2 metadata columns:
##         seqnames        ranges strand | subject_id inframe2end
##            <Rle>     <IRanges>  <Rle> |  <integer>   <numeric>
##    1 chloroplast   86762-93358      + |          1           2
##    2 chloroplast   72371-73897      + |          2           2
##    3 chloroplast   54937-56397      + |          3           3
##    4 chloroplast   57147-58541      + |          4           1
##    5 chloroplast   33918-35141      + |          5           1
##   ..         ...           ...    ... .        ...         ...
##   10 chloroplast   60741-61430      + |         10           1
##   11 chloroplast 143088-143708      + |         11           1
##   12 chloroplast   62038-62619      + |         12           3
##   13 chloroplast   59772-60326      + |         13           1
##   14 chloroplast   75804-76292      + |         14           1
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths

Once then, we could create the ORF plot for viewing the ORFs position within the reference sequence. For instance, we will use Gviz from Bioconductor.

library(Gviz)

options(ucscChromosomeNames=FALSE)

orfTrack <- AnnotationTrack(predicted_orfs, name = "ORFs", 
                            fill = "darkblue", stacking = "dense")

# Genomic Axis
genomeAxis <- GenomeAxisTrack()

plotTracks(list(genomeAxis, orfTrack), from = 1, to = width(dna_object[1]))