Predicting ORFs starts with knowing the starting and ending codons in a genomic loci, this can be useful for finding potential genes de novo. In this, we’ll use systemPipeR, Biostrings and GRanges in a downstream manner that will allow to cross-reference other data such RNAseq.
The dataset is composed of a short (~150 Kb) DNA sequence of Arabidopsis chloroplast genome as input. It is called arabidopsis_choloroplast.fa. (https://github.com/PacktPublishing/R-Bioinformatics-Cookbook/tree/master).
if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install(version = "3.21")
BiocManager::install("IRanges") #
BiocManager::install("karyoploteR")
BiocManager::install("Gviz")
BiocManager::install("systemPipeR")
BiocManager::install("Biostrings")
library(SummarizedExperiment)
### Also we need the readr package
install.packages("readr")
## package 'readr' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\USUARIO\AppData\Local\Temp\RtmpSejXM2\downloaded_packages
install.packages("Gviz")
install.packages("magrittr")
## package 'magrittr' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\USUARIO\AppData\Local\Temp\RtmpSejXM2\downloaded_packages
install.packages("ggplot2")
## package 'ggplot2' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\USUARIO\AppData\Local\Temp\RtmpSejXM2\downloaded_packages
library(ggplot2)
library(karyoploteR)
library(GenomicRanges)
library(systemPipeR)
library(Biostrings)
Once we have installed all the packages, we load the datasets (arabidopsis_chroloplast.fa):
dna_object <- readDNAStringSet(file.path(getwd(), "datasets", "arabidopsis_chloroplast.fa")) #First we have to load the data
Then, we will be able to predict the ORFs with predORF:
predicted_orfs = predORF(dna_object, n = 'all', type = 'gr',
mode = 'ORF', strand = 'both', longest_disjoint = TRUE)
predicted_orfs
## GRanges object with 2501 ranges and 2 metadata columns:
## seqnames ranges strand | subject_id inframe2end
## <Rle> <IRanges> <Rle> | <integer> <numeric>
## 1 chloroplast 86762-93358 + | 1 2
## 1162 chloroplast 2056-2532 - | 1 3
## 2 chloroplast 72371-73897 + | 2 2
## 1163 chloroplast 77901-78362 - | 2 1
## 3 chloroplast 54937-56397 + | 3 3
## ... ... ... ... . ... ...
## 2497 chloroplast 129757-129762 - | 1336 3
## 2498 chloroplast 139258-139263 - | 1337 3
## 2499 chloroplast 140026-140031 - | 1338 3
## 2500 chloroplast 143947-143952 - | 1339 3
## 2501 chloroplast 153619-153624 - | 1340 3
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
# With the both strands, we are looking for all ORFs in + or - DNA strand.
This is an object of GRanges (Genomic Ranges) type, this represents the coordinates of the ORFs found in the sequence of Arabidopsis chloroplast. With predORF() we found 2,501 Open Reading Frames within the chloroplast reference sequence. If we want to extract the sequences we could make a single code line:
orf_seq = getSeq(dna_object,predicted_orfs)
orf_seq
## DNAStringSet object of length 2501:
## width seq names
## [1] 6597 ATGGTCGAAAGCAAAAATCTCT...ATGAAAATAGGATTCATGTAA 1
## [2] 477 ATGCTAGAAAACTCATTTCTAA...GATTTGGTCAATCATGAATAA 1162
## [3] 1527 ATGGGTTTGCCTTGGTATCGTG...ACAAAACGACAAGCAGTCTGA 2
## [4] 462 ATGCCTGTTGAAAATGCCAATC...GACATTCTAGAAAAAAAATAG 1163
## [5] 1461 ATGACTTGTAGGGAGGGACTTA...AAATTAGATGGCCAAGAGTAG 3
## ... ... ...
## [2497] 6 ATGTGA 2497
## [2498] 6 ATGTGA 2498
## [2499] 6 ATGTGA 2499
## [2500] 6 ATGTAA 2500
## [2501] 6 ATGTAG 2501
bases = c("A", "C", "T", "G") # We use the typical DNA format
raw_seq_string = strsplit(as.character(dna_object), "")
seq_length = width(dna_object[1])
#For each DNA base, count how many times it appears in the raw_seq_string
counts = lapply(bases, function(x) {
sum(raw_seq_string[[1]] == x)
})
# Compute relative base counts
probs = unlist(lapply(counts, function(base_count) {
signif(base_count / seq_length, 2)
}))
get_longest_orf_in_random_genome <- function(x,
length = 1000,
probs = c(0.25, 0.25, 0.25, 0.25),
bases = c("A","C","T","G")){
random_genome <- paste0(sample(bases, size = length, replace =
TRUE, prob = probs), collapse = "")
random_dna_object <- DNAStringSet(random_genome)
names(random_dna_object) <- c("random_dna_string")
orfs <- predORF(random_dna_object, n = 1, type = 'gr',
mode='ORF', strand = 'both', longest_disjoint = TRUE)
return(max(width(orfs)))
}
For practicing, we force the code to run the last function on 10 simulated genomes of 1000 lenght:
random_lengths <- unlist(lapply(1:10,
get_longest_orf_in_random_genome, length = seq_length, probs =
probs, bases = bases))
longest_random_orf <- max(random_lengths) # Get the lenght of the longest random ORF
For keeping only predicted ORFs longer than the longest random ORF we use:
keep <- width(predicted_orfs) > longest_random_orf
orfs_to_keep <- predicted_orfs[keep]
orfs_to_keep
## GRanges object with 14 ranges and 2 metadata columns:
## seqnames ranges strand | subject_id inframe2end
## <Rle> <IRanges> <Rle> | <integer> <numeric>
## 1 chloroplast 86762-93358 + | 1 2
## 2 chloroplast 72371-73897 + | 2 2
## 3 chloroplast 54937-56397 + | 3 3
## 4 chloroplast 57147-58541 + | 4 1
## 5 chloroplast 33918-35141 + | 5 1
## .. ... ... ... . ... ...
## 10 chloroplast 60741-61430 + | 10 1
## 11 chloroplast 143088-143708 + | 11 1
## 12 chloroplast 62038-62619 + | 12 3
## 13 chloroplast 59772-60326 + | 13 1
## 14 chloroplast 75804-76292 + | 14 1
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
Once then, we could create the ORF plot for viewing the ORFs position within the reference sequence. For instance, we will use Gviz from Bioconductor.
library(Gviz)
options(ucscChromosomeNames=FALSE)
orfTrack <- AnnotationTrack(predicted_orfs, name = "ORFs",
fill = "darkblue", stacking = "dense")
# Genomic Axis
genomeAxis <- GenomeAxisTrack()
plotTracks(list(genomeAxis, orfTrack), from = 1, to = width(dna_object[1]))