FAQ: How do I calculate the frequency of each nucleotide in a DNA sequence?

# required packages
library(Biostrings)

## Loading required package: BiocGenerics

## Loading required package: parallel

## 
## Attaching package: 'BiocGenerics'

## The following objects are masked from 'package:parallel':
## 
##     clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
##     clusterExport, clusterMap, parApply, parCapply, parLapply,
##     parLapplyLB, parRapply, parSapply, parSapplyLB

## The following objects are masked from 'package:stats':
## 
##     IQR, mad, sd, var, xtabs

## The following objects are masked from 'package:base':
## 
##     anyDuplicated, append, as.data.frame, basename, cbind, colnames,
##     dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep,
##     grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget,
##     order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
##     rbind, Reduce, rownames, sapply, setdiff, sort, table, tapply,
##     union, unique, unsplit, which.max, which.min

## Loading required package: S4Vectors

## Loading required package: stats4

## 
## Attaching package: 'S4Vectors'

## The following objects are masked from 'package:base':
## 
##     expand.grid, I, unname

## Loading required package: IRanges

## 
## Attaching package: 'IRanges'

## The following object is masked from 'package:grDevices':
## 
##     windows

## Loading required package: XVector

## Loading required package: GenomeInfoDb

## 
## Attaching package: 'Biostrings'

## The following object is masked from 'package:base':
## 
##     strsplit

library(pander)

## Warning: package 'pander' was built under R version 4.1.2

QUESTION: How do I calculate the frequency of each nucleotide in a DNA sequence?

For any given DNA sequence, it may be useful to calculte the porprtion or percentage of each base in the sequence.

Setting up the data

I am just using a random sequence of DNA. I am also saving the length of the sequence since this will be useful later.

seq <- "CATTCAGTGC"
seq_length <- nchar(seq)

Convert to character vector

The first step is to split this into a character vector (separate the bases) The argument split specifies what where the string should be split. Since we want individual bases, an empty string will do.

Also, since the output is a list, and we only want the first element, I index into the list specifying that the seqience should only hold one element.

seq <- strsplit(seq, split="")[[1]]
seq

##  [1] "C" "A" "T" "T" "C" "A" "G" "T" "G" "C"

Count base occurences in sequence

We can find out the number of occurences of each base using the function table()

seq_table <- table(seq)

Again, we can index into the table to separate out the numbers for each base.

a.count <- seq_table[[1]]
c.count <- seq_table[[2]]
g.count <- seq_table[[3]]
t.count <- seq_table[[4]]

Calculate proportions

a.prop <- a.count/seq_length
c.prop <- c.count/seq_length
g.prop <- g.count/seq_length
t.prop <- t.count/seq_length

Store in a dataframe

This step is optional but a useful way to view the data

seq_df <- data.frame(a.prop, c.prop, g.prop, t.prop)
colnames(seq_df) <- c("A", "C", "G", "T")

pander::pander(seq_df)

A	C	G	T
0.2	0.3	0.2	0.3

Additional Reading

For more information on this topic, see

TODO: find one resource related to this topic, such as those found on https://www.statmethods.net/index.html, https://r-charts.com/, http://www.r-tutor.com/, http://www.sthda.com/. (http://www.sthda.com/ is run by the author of ggpubr and has lots of resources for it).

Keywords

sequence
DNA
proportion
strsplit()
table()