Johns Hopkins University Coursera Data Science Specialization & SwiftKey

n-gram Text Prediction Model

Eric Thompson
January 2018

Executive Summary

  • We have built an easy-to-use and highly accurate text prediction algorithm.
  • Beginning with just raw text files consisting of several million blog posts, tweets and news articles, we have cleaned and tokenized the text and built an n-gram text prediction model.
  • To use the product, the user simply navigates Shiny app on the web, enters any amount of text and then almost instantaneously receives a prediction for the next word. Simple!
  • Click here for the GitHub repo containing a README and full R scripts for our Shiny app
  • Our Shiny app is deployed here: this link

Description of the Algorithm

  • We first obtained a random sample of the English language from a web crawling service which sampled several million blog posts, tweets and news articles. We merged these text files, took a random sample, converted it to a corpus object and created the tidy input dataset for our model.
  • We did this by finding the most frequently-ocurring groups of words (“n-grams”). For example “I really love football” is a 4-gram and “me too” is a 2-gram.
  • Given a particular n-gram entered by the user, our algorithm first tries to find the n-gram in our input dataset, and if it is found, the user receives back the subsequent word that most-frequently follows that particular n-gram.
  • If the user's n-gram is not found, then a list of so-called “skip-grams” is created from the n-gram, and the algorithm searches sequentially for each of these skip-grams in the dataset. As soon as a match is found, the associated predicted word is returned, just as described above.
  • In our above example of “I really love football”, skip-grams would include “I love football”, “I really football”, “really love football” and “I really love”.
  • Our algorithm eliminates and skip-grams which do not contain the final word entered by the user. In the above example, the n-gram “I really love” would thus not be considered as a possible match in our dataset. This improves model accuracy because the final word in the input is often very closely related to the prediction word; eliminating it hurts accuracy.
  • The following slide shows the algorithm's code.

Description of the Algorithm (cont'd)

  • variable a is entered by the user,
  • toks_list is the list of tokens entered by the user; we use the rev() function to ensure the algorithm first searches for the user's n-gram and, if not found, proceeds in a logical, sequential manner through the possible skip-grams, and
  • last is the final word in the phrase entered by the user and a required part of the skip-gram
a <- NULL
f <- function(a = NULL) {
        # User provides an n-gram "a" and we create a list of its tokens
        toks <- tokens(a)
        toks_list <- tokens_skipgrams(toks,
                                      n = 1:length(toks[[1]]),
                                      skip = 0:4,
                                      concatenator = "_")
        # Re-order token list so it begins with the n-gram (no skips) and proceeds sequentially
        toks_list <- rev(toks_list[[1]])
        # Require the last word entered by user to be matched as part of skip-gram
        last <- toks[[1]][length(toks[[1]])]
        toks_list <- toks_list[grep(last, toks_list)]
        # Loop through the tokens list until a match is found in input dataset
        for (i in 1:length(toks_list)) {
                if (nrow(input[`n-1` == toks_list[i], ]) != 0) {
                        # Return predicted word
                        return(input[`n-1` == toks_list[i], n])
                        break()
                }
        }
}

Model Accuracy

  • We tested model accuracy by partitioning the original corpus into training and testing corpora. We then sampled 250,000 random n-grams from the testing corpus and and evaluated our model over this set.
  • Our model shows accuracy of 5.87%. In other words, given a random input of text (an n-gram), our model correctly predicts the subsequent word approximately 5.87% of the time.
  • For context, The Second Edition of the 20-volume Oxford English Dictionary contains approximately 171,000 words in current use.
  • In other words, our model's 5.87% accuracy is approximately 10,000 times better than a random guess. That is a terrific result.
  • Please contact the author, Eric Thompson (etsafe11@gmail.com), with any inquiries.