Text Prediction

John Sinues
April 2015

John Hopkins Data Science Capstone Project

probability formulas

Goals/Considerations

Goal

Develop an application that takes an input phrase and predicts the next word.

Considerations

Accuracy -vs- Performance -vs- Resources

  • The prediction should be reasonable for a statement ranging from 1 to N words.
  • The prediction should be responsive. You don't want to wait long for an answer.
  • The model must be compact and small enough to load onto the Shiny server.

Setting The Stage

Three files comprised of US blogs, news, and twitter entries provided the data to develop the model. Combined they were 558MB, contained over 4.2 million unique terms and had line lengths in excess of 40K characters.

Prior to building the model, the data was cleaned by removing offensive words and profanity, spell checked, and expundged of non-printable characters and punctuation.

After the data cleansing process, a word frequency table was created as well as a table of N-grams of two, three, and four word combinations.

Finally, a model was created using this data and the results presented as a Shiny application.

The Model

Utilize a back-off N-gram frequency model to estimate the end word.

  • Create a frequency table for each unique term in the corpus.
  • Create N-grams based upon the corpus where N is 2, 3, and 4.
  • Create a frequency table for each unique N-gram.

Predicting The Word pseudocode

The Product

application ui

  • During startup, wait until “begin” is displayed then type in a phrase.
  • Click Predict Next Word to predict the next word; Clear to clear the field.

Features

  • Moderate sized model storing N-grams and frequencies (< 60 MB) as CSV files.
  • Responsive application.
  • A slider control displaying a word cloud visualization of other probable words that may complete the user's input.
  • Vocabulary of over 31K terms.