Predictagram: A predictive text model

David L Denton
5/18/2016

Completed as part of the Coursera Data Science Capstone (in cooperation with Johns Hopkins University and SwiftKey).

Overview

Smartphone keyboards commonly utilize predictive text models to simplify the process of composing a message on a small screen. The fewer the keystrokes required, the better the user experience is. SwiftKey is one example of a company that has built a successful business model on this idea.

Following SwiftKey's example, the goal of this project is to build an algorithm that could be used to generate the next word in a given phrase.

The Data

  • The primary data source used to develop the algorithm is from the HC Corpora, a publically available repository.
  • It is a collection of english language text pulled from the three primary sources.
    • Twitter
    • Blogs
    • News
  • This data requires some processing before being used in a predictive model.
    • Punctuation, numbers, special characters, and profanity is removed.
    • All text is transformed to lower case.

Methods & Model

  • Using R, the corpus is processed into n-grams, 2-, 3-, 4-, and 5-word tokens.
  • The n-grams are then subdivided into a 'predictor' and a 'prediction'.
    • For example, the n-gram “we are going to” becomes:
      • predictor: “we are going”
      • prediction: “to”
  • Input strings are matched to the predictor variable with the same number of words, and the associated prediction variable is returned.
  • If an input string with n words does not match any of the available predictors, it is truncated to n-1 words and compared to the predictors in the next n-gram dictionary.
    • For example, the phrase “Appolonia got to”, would be truncated to “got to” and then matched to the predictors in the 3-gram dictionary.

The Tool

  • The model has been turned into a shiny app, located here. screencap
  • It allows users to input a phrase of any length and see the most likely next word.
  • To improve performance, the n-gram dictionaries were subset to include only those n-grams that appeared more than once in the corpus.
    • These subsets total approximately 4 million n-grams.
    • Before subsetting, the n-gram dictionaries had over 70 million n-grams.

Additional Information

  • All preprocessing was done with R.
  • The R script and associated notes are available here.