Letter Frequency

Kamerlingh
July 21, 2015

Letter Frequency is a ShinyApp that calculates the frequency that every letter appears in a sample of writing and compares the results graphically to the letter frequencies of the English language or Shakepeare's writing.

This ShinyApp is a fun way of comparing samples of writing from different authors or trying to tackle texts encrypted with a substitution cypher. It can also be used to compare letter frequencies of other languages with English.

You can find Letter Frequency at https://kamerlingh.shinyapps.io/dataproducts.

What's so interesting about letter frequencies?

Identifying Authors

Authors use a distinct mix of letters, leading to unique letter frequency “signatures”, which means letter frequency can be used as evidence of a work’s authenticity.

If someone claimed that they found a previously undiscovered play and attributed it to Shakespeare, but the letter frequency of this play did not match that of Shakespeare’s other work, we should be skeptical of this claim.

Cryptography

The letter frequency of a text that has been encrypted with a substition cypher can be analyzed to figure out the cypher.

In a simple substitution cypher, every letter of a text is changed into a another letter. For example, all A’s in the unencrypted text might be changed to T’s in the encrypted text.

Some letter frequency data are already available for comparison

Letter Frequency will generate a bar plot of the frequencies that every letter occurs in the sample text input by the user and compare them to the frequencies of either every word in the English language [1] or Shakespeare’s writing [2].

[1]: From Cryptological Mathematics by Robert Lewand via Wikipedia.
[2]: From a student project at UCSD.

plot of chunk plot

Under the hood: How to calculate letter frequencies in R

library(stringr)
abc = c("e","t","a","o","i","n","s","h","r","d","l","c",
        "u","m","w","f","g","y","p","b","v","k","j","x","q","z")
English = c(12.702, 9.056, 8.167, 7.507, 6.966, 6.749, 6.327, 6.094, 5.987,
            4.253, 4.025, 2.782, 2.758, 2.406, 2.361, 2.228, 2.015, 1.974,
            1.929, 1.492, 0.978, 0.772, 0.153, 0.15, 0.095, 0.074)

lettercount <- function(text){
    text <- tolower(text)
    length <- str_length(text)
    freq <- as.data.frame(rep(0,26))

    count_char <- function(letter,some_text){
        str_count(some_text,letter)*100/length
    }

    #Calculate the frequency of each letter in the text.
    freq <- apply(as.data.frame(abc),1,count_char,some_text=text)

    #Return a table with the calculated frequencies and those of English.
    t(as.data.frame(cbind(English,freq)))
}

Example output: A comparison between Lorem Ipsum and the English language

Sample text: Lorem Ipsum

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.

plot of chunk unnamed-chunk-2