Word Prediction

Coursera Data Science Specialization
Capstone Project

Axel Schwanke
2016-01-24

The application predicts the most likely next word of an incomplete sentence. It also supports the user in completion of the currently typed word. The prediction algorithm is based on an n-gram language model with interpolation and discounting techniques. The application was developed in R using the R shiny package.

Open Shiny application

How the predictive model works ...

N-grams are the basis of the word prediction application (Wikipedia).

  • An n-gram is a contiguous sequence of n items from a given text
  • N-grams were collected from twitter, news, and blogs text corpora
  • Each n-gram gets associated the frequency, i.e. the number of times it occures in that corpora.
  • The probability of the occurence of the next word in a sentence can be computed from the previous words. The combination of previous (n-1) words plus next word (1) can be viewed as an n-gram
  • To predict the next word of a sentence, an algorithm looks for all n-grams with the first (n-1) words matching the last (n-1) words of the sentence. The next word of the sentence is then predicted as the last word of the combined n-grams (n=2…4), that has the highest weighted frequency.
  • The prediction is improved by smothing proabibities for rare or unseen words by absolute discounting (Katz back-off model).

How the n-gram tables were created ...

The n-grams (n=1..4) were created from the text corpora (news, blogs, twitter) with 5 million lines of text (HC Corpora) :

  • Cleaning and preprocessing the text (remove UTF8, convert to lowercase, removing punctuation, whitespace, twitter tags, profanity, etc.)
  • Create the n-grams via the dfm function of the R quanteda package
  • Keep only those n-grams with a frequency > X (X between 1 and 5 for n=1…4)
  • Calculate probability of each n-gram (frequency divided by sum of frequencies)
  • Split each n-gram (n=2..4) in (n-1) preceeding words and (1) last word
  • Replace preceeding words with it's hash value
  • Store the n-gram as an R data.table with key on preceeding words

N-gram tables for the Shiny application:

# n-grams # n-grams for 50% coverage # n-grams for 90% coverage
1-gram 75,002 100 4,141
2-gram 2,325,469 16,514 655,272
3-gram 3,802,948 175,497 2,275,668
4-gram 5,812,920 731,747 4,518,792

How the application works ...

  • the user opens the Shiny application
  • the user enters a sequence of words
  • the input text is cleaned and preprocessed (same as for n-gram creation)
  • the last (up to 3) words of the input text are extracted
  • based on the n-gram tables the next most likely word is predicted
  • if selected, further words and phrases are predicted
  • for incomplete last words, word completions are offered
  • if the user presses a completion or prediction button, the corresponding word is appended to the input text
  • the algoritm starts over and predicts the next word

What results were produced ...

Some Facts:

  • PC: Windows7x64, 32GB RAM
  • Preprocessing the corpus (5 million lines of text) took about 0.5 hours.
  • Creating the n-grams (n=1…4) took about 2 hours.

Benchmark results:

  • Overall top-3 score: 18.86 %
  • Overall top-1 precision: 14.11 %
  • Overall top-3 precision: 23.07 %
  • Average runtime: 19.89 msec
  • Total memory used: 246.02 MB

Open Shiny application