# required packages
library(Biostrings)
## Loading required package: BiocGenerics
## Loading required package: parallel
##
## Attaching package: 'BiocGenerics'
## The following objects are masked from 'package:parallel':
##
## clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
## clusterExport, clusterMap, parApply, parCapply, parLapply,
## parLapplyLB, parRapply, parSapply, parSapplyLB
## The following objects are masked from 'package:stats':
##
## IQR, mad, sd, var, xtabs
## The following objects are masked from 'package:base':
##
## anyDuplicated, append, as.data.frame, basename, cbind, colnames,
## dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep,
## grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget,
## order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
## rbind, Reduce, rownames, sapply, setdiff, sort, table, tapply,
## union, unique, unsplit, which.max, which.min
## Loading required package: S4Vectors
## Loading required package: stats4
##
## Attaching package: 'S4Vectors'
## The following objects are masked from 'package:base':
##
## expand.grid, I, unname
## Loading required package: IRanges
##
## Attaching package: 'IRanges'
## The following object is masked from 'package:grDevices':
##
## windows
## Loading required package: XVector
## Loading required package: GenomeInfoDb
##
## Attaching package: 'Biostrings'
## The following object is masked from 'package:base':
##
## strsplit
library(pander)
## Warning: package 'pander' was built under R version 4.1.2
For any given DNA sequence, it may be useful to calculte the porprtion or percentage of each base in the sequence.
I am just using a random sequence of DNA. I am also saving the length of the sequence since this will be useful later.
seq <- "CATTCAGTGC"
seq_length <- nchar(seq)
The first step is to split this into a character vector (separate the bases) The argument split specifies what where the string should be split. Since we want individual bases, an empty string will do.
Also, since the output is a list, and we only want the first element, I index into the list specifying that the seqience should only hold one element.
seq <- strsplit(seq, split="")[[1]]
seq
## [1] "C" "A" "T" "T" "C" "A" "G" "T" "G" "C"
We can find out the number of occurences of each base using the function table()
seq_table <- table(seq)
Again, we can index into the table to separate out the numbers for each base.
a.count <- seq_table[[1]]
c.count <- seq_table[[2]]
g.count <- seq_table[[3]]
t.count <- seq_table[[4]]
a.prop <- a.count/seq_length
c.prop <- c.count/seq_length
g.prop <- g.count/seq_length
t.prop <- t.count/seq_length
This step is optional but a useful way to view the data
seq_df <- data.frame(a.prop, c.prop, g.prop, t.prop)
colnames(seq_df) <- c("A", "C", "G", "T")
pander::pander(seq_df)
| A | C | G | T |
|---|---|---|---|
| 0.2 | 0.3 | 0.2 | 0.3 |
For more information on this topic, see
TODO: find one resource related to this topic, such as those found on https://www.statmethods.net/index.html, https://r-charts.com/, http://www.r-tutor.com/, http://www.sthda.com/. (http://www.sthda.com/ is run by the author of ggpubr and has lots of resources for it).