Executive summary

In my preliminary examination of the text, I found that propositions were the most common words in any corpus. There were the most words overall in the blog corpus, followed by the news, and twitter corpuses. My findings and a sketch of a predictive text app based on them follows.

Loading the data

First, we loaded corpuses from blogs, news, and twitter sources (in order of decreasing size) and combined a random sample of one third of each of them to form a test data set.

Tokenization

The text was then broken into individual words, ordered word pairs (bigrams), and ordered word triples (trigrams) that followed the order of lines in the text. In order to do the decomposition, I used the package tidytext.

Exploration

The twiter corpus contains 2,360,148 lines, the news corpus 1,010,242 lines, and the blog corpus 899,288 lines. On the other hand, the number of words in each corpus exhibited the exact oposite trend: blogs had the most, followed by news, and twitter.

Counting the words in the the text we see that by far the most common are prepositions.

## visualize count
ggplot(wordCount[1:25,], aes(x = reorder(word, n), y = n)) +
    geom_col() + xlab("Word") + ylab("Count") + coord_flip()

Similarly, most common bigrams are pairs of prepositions or the connective ‘if’.

## visualize count
bigramsJoined <- mutate(bigramCount[1:25,], 
                        word = paste(word1, word2)) %>%
    select(word, n)
ggplot(bigramsJoined, aes(x = reorder(word, n), y = n)) +
    geom_col() + xlab("Word") + ylab("Count") + coord_flip()

It turns out that 45% of text in the corpus is made up of just 100 words, while it takes 50,000 words to cover 98.5% of the text.

coverage50 <- sum(wordCount$n[1:50])/sum(wordCount$n) # 37%
coverage100 <- sum(wordCount$n[1:100])/sum(wordCount$n) # 45%
coverage500 <- sum(wordCount$n[1:500])/sum(wordCount$n) # 62%
coverage1000 <- sum(wordCount$n[1:1000])/sum(wordCount$n) # 70%
coverage5000 <- sum(wordCount$n[1:5000])/sum(wordCount$n) # 86%
coverage10000 <- sum(wordCount$n[1:10000])/sum(wordCount$n) # 92%
coverage50000 <- sum(wordCount$n[1:50000])/sum(wordCount$n) # 98.5%
words <- c(50,100,500,1000,5000,10000,50000)
coverage <- c(coverage50,coverage100,coverage500,
              coverage1000,coverage5000,
              coverage10000,coverage50000)
coverageDf <- data.frame(words, coverage)
coverageDf
##   words  coverage
## 1    50 0.3688883
## 2   100 0.4493285
## 3   500 0.6229877
## 4  1000 0.6991207
## 5  5000 0.8636638
## 6 10000 0.9159192
## 7 50000 0.9790247

Predictive text app

In order to predict text, I will use a trigram model that is restricted to the 50,000 most common words to reduce size. The prediction algorithm will work as follows:

  1. If there are at least two words, check to see if they match the first two of our trigrams.
    1. If they do, suggest the word from the most common trigram begining with the first two words (this is equivallent to a Maximum Likelihood Estimate).
    2. If it does not, then check to see if the second word appears in the trigram.
      1. If it does, then suggest the word from the most common trigram with that word as its second word.
      2. If it does not, suggest the most common word ‘a’.
  2. If there is only one word, check to see if it appears as the second word of the trigram.
    1. If it does suggest the most common third word of trigrams with that second word.
    2. If it does not, suggest the most common word ‘a’.
  3. If the word does not appear in the corpus, suggest the most common word ‘a’.

The Shiny app will allow users to type text into one field, and display the predicted result for the next word in a clickable box in real-time as the user types. If the box is clicked the word will be entered as the next one into the text.