Predict the Next Word

Anat Kedem
August 2015

Predict the Next Word for Textual Input

OVERVIEW

A prediction model applied by Shiny application


Capstone final project assignment.

Algorithm

Data Source:
Around 500,000 textual observations were analyzed, originally from Twitter, blogs and news. The text files were sampled to create a training dataset for the analyses and a testing dataset (~20,000 observations) for accuracy tests.

Cleaning the Data

  • Collapse repeated word and character sequences.
  • Remove extra spaces, abnormal long “words”, long “abbreviations” (dot separated), and extra punctuations.
  • Transform preset formats as URLs, “$20”, “16tn” etc, to placeholders as “email”, “money”, “numth” etc. Transform common “shorthand” as “u r” (you are).
  • Re-break the observations by sentence separators as . ! ?

Algorithm (continue)

N-Gram Tables

  • Split the clean sentences into separated words.
  • Create 2-gram and up to 6-gram. Total 5 tables.
  • Add statistics variables to the tables.
  • Filter the tables: consider files size vs. word sequences statistic importance. Total size of the app is less than 20mb.

Extracting a Predicted Word

  • Read the user input text, clean, split to words and count.
  • Choose the highest n-gram according to the word counting. For example, for 4 words input, search the 5-gram table for the first 4 words and return the fifth (the predicted word). If a match was not found, take only the last 3 words and search the 4-gram table, etc.
  • In case of an empty input, predict a first word based on the statistics of most frequent first word.

Shiny Application

How to Use the Application The application window include an input/output text area. The user is guided to type/insert some text and click the Submit text button. Then, the predicted next word is added at the end of the text.

Cinque Terre

Model Accuracy

The Statistics Behind the Prediction n-grams are tables of n-words sequences. For example, to create 4-gram table, the source text is broken to sequences of 4 words. For example, “The weather channel website is down.” will be inserted into a table of four columns, where each row keep one sequence of 4 words. Higher frequency x (see below) means higher chances that Word4 will appear right after the sequence of Word1 to Word3.

Cinque Terre

To check the accuracy of prediction, two tests were applied with the model, both with a testing dataset of 20,000 text rows. One checked the accuracy during typing, comparing any predicted next word to the real next word in the data. The second test, checked the accuracy of predicting only the last word of each row. For the first, the accuracy is ~20%, and for the second it is lower, ~14%, since the model prediction also suggest conjunctions and stop words that cannot fit as last words. Improving the model probably required adding context.