Word Statistics in DNA

The simplest patterns that you can have in DNA are words. It can be useful to carry out a statistical analysis of the different word lengths in DNA to see if there is a particular skew in their distribution and to determine which evolutionary models are most likely to apply to a particular set of sequences. This can be important if an organism has an unusual distribution of nucleotides.

The core R table function will summarise the total numbers of the nucleotides. To look for words you use the count function.

library("seqinr")
seq <- read.fasta("dnafasta.fas")

table(seq)
LEGACY
  a   c   g   t 
238 265 228 190 
seq <- as.data.frame(seq)
count(seq$LEGACY, 1)

  a   c   g   t 
238 265 228 190 
doublet <- count(seq$LEGACY, 2)
doublet

aa ac ag at ca cc cg ct ga gc gg gt ta tc tg tt 
83 67 46 41 65 75 69 56 56 74 52 46 33 49 61 47 
triplet <- count(seq$LEGACY, 3)
triplet

aaa aac aag aat aca acc acg act aga agc agg agt ata atc atg att caa cac cag cat 
 33  28  16   5  13  27  14  13   6  21  11   8   5  17  12   7  17  14  25   9 
cca ccc ccg cct cga cgc cgg cgt cta ctc ctg ctt gaa gac gag gat gca gcc gcg gct 
 24  17  22  12  19  15  20  15   5  14  26  11  22  15   2  17  16  13  24  21 
gga ggc ggg ggt gta gtc gtg gtt taa tac tag tat tca tcc tcg tct tga tgc tgg tgt 
  9  21   7  15   8   9  11  18  11  10   3   9  12  18   9  10  22  17  14   8 
tta ttc ttg ttt 
 15   9  12  11 
four <- count(seq$LEGACY, 4)
four

aaaa aaac aaag aaat aaca aacc aacg aact aaga aagc aagg aagt aata aatc aatg aatt 
  14   10    8    1    8    3    6   11    2    4    6    4    2    1    1    1 
acaa acac acag acat acca accc accg acct acga acgc acgg acgt acta actc actg actt 
   4    5    1    3    9    8    7    3    3    4    4    3    1    3    8    1 
agaa agac agag agat agca agcc agcg agct agga aggc aggg aggt agta agtc agtg agtt 
   2    2    0    2    5    5    7    4    2    4    3    2    0    1    5    2 
ataa atac atag atat atca atcc atcg atct atga atgc atgg atgt atta attc attg attt 
   2    2    0    1    7    3    3    4    5    3    1    3    1    2    1    3 
caaa caac caag caat caca cacc cacg cact caga cagc cagg cagt cata catc catg catt 
   6    7    3    1    0   11    3    0    4   15    3    3    1    5    1    2 
ccaa ccac ccag ccat ccca cccc cccg ccct ccga ccgc ccgg ccgt ccta cctc cctg cctt 
   4    4   13    3    5    3    7    2    5    4   12    1    2    2    4    4 
cgaa cgac cgag cgat cgca cgcc cgcg cgct cgga cggc cggg cggt cgta cgtc cgtg cgtt 
   5    6    1    7    1    4    4    6    3   10    0    7    2    4    3    6 
ctaa ctac ctag ctat ctca ctcc ctcg ctct ctga ctgc ctgg ctgt ctta cttc cttg cttt 
   2    2    0    1    3    7    3    1    9    8    6    3    2    2    4    3 
gaaa gaac gaag gaat gaca gacc gacg gact gaga gagc gagg gagt gata gatc gatg gatt 
   8    8    4    2    4    8    3    0    0    1    0    1    2    8    6    1 
gcaa gcac gcag gcat gcca gccc gccg gcct gcga gcgc gcgg gcgt gcta gctc gctg gctt 
   6    3    6    1    4    1    3    5    7    6    2    9    1    7   10    3 
ggaa ggac ggag ggat ggca ggcc ggcg ggct ggga gggc gggg gggt ggta ggtc ggtg ggtt 
   5    2    0    2    7    1    9    4    1    2    2    2    5    3    2    5 
gtaa gtac gtag gtat gtca gtcc gtcg gtct gtga gtgc gtgg gtgt gtta gttc gttg gttt 
   4    2    2    0    1    3    3    2    5    2    2    2    7    4    5    2 
taaa taac taag taat taca tacc tacg tact taga tagc tagg tagt tata tatc tatg tatt 
   5    3    1    1    1    5    2    2    0    1    2    0    0    3    3    3 
tcaa tcac tcag tcat tcca tccc tccg tcct tcga tcgc tcgg tcgt tcta tctc tctg tctt 
   3    2    5    2    6    5    5    2    4    1    2    2    1    2    4    3 
tgaa tgac tgag tgat tgca tgcc tgcg tgct tgga tggc tggg tggt tgta tgtc tgtg tgtt 
  10    5    1    6    3    3    4    7    3    5    2    4    1    1    1    5 
ttaa ttac ttag ttat ttca ttcc ttcg ttct ttga ttgc ttgg ttgt ttta tttc tttg tttt 
   3    4    1    7    1    5    0    3    3    4    5    0    5    1    2    3 

As tabulated data it is hard to see any differences. It is much easier with a graphical display of the data.

library("ggplot2")
doublet <- as.data.frame(doublet)
ggplot(doublet, aes(x=Var1, y=Freq)) + 
  geom_bar(stat = "identity", width=0.4, color="#4B0082") +
  xlab("Doublet") +
  coord_flip()

The TA and AT pairings are under-represented compared to the others and this is what you expect for a gene sequence which is GC rich. The T nucleotide is under-represented overall.