Word Prediction

Coursera Data Science Specialization
Capstone Project

Axel Schwanke
2016-01-24

The application predicts the most likely next word of an incomplete sentence. It also supports the user in completion of the currently typed word. The prediction algorithm is based on an n-gram language model with interpolation and discounting techniques. The application was developed in R using the R shiny package.

Open Shiny application

How the predictive model works ...

N-grams are the basis of the word prediction application (Wikipedia).

An n-gram is a contiguous sequence of n items from a given text

N-grams were collected from twitter, news, and blogs text corpora

Each n-gram gets associated the frequency, i.e. the number of times it occures in that corpora.

The probability of the occurence of the next word in a sentence can be computed from the previous words. The combination of previous (n-1) words plus next word (1) can be viewed as an n-gram

To predict the next word of a sentence, an algorithm looks for all n-grams with the first (n-1) words matching the last (n-1) words of the sentence. The next word of the sentence is then predicted as the last word of the combined n-grams (n=2…4), that has the highest weighted frequency.

The prediction is improved by smothing proabibities for rare or unseen words by absolute discounting (Katz back-off model).

How the n-gram tables were created ...

The n-grams (n=1..4) were created from the text corpora (news, blogs, twitter) with 5 million lines of text (HC Corpora) :

Cleaning and preprocessing the text (remove UTF8, convert to lowercase, removing punctuation, whitespace, twitter tags, profanity, etc.)
Create the n-grams via the dfm function of the R quanteda package
Keep only those n-grams with a frequency > X (X between 1 and 5 for n=1…4)
Calculate probability of each n-gram (frequency divided by sum of frequencies)
Split each n-gram (n=2..4) in (n-1) preceeding words and (1) last word
Replace preceeding words with it's hash value
Store the n-gram as an R data.table with key on preceeding words

N-gram tables for the Shiny application:

	# n-grams	# n-grams for 50% coverage	# n-grams for 90% coverage
1-gram	75,002	100	4,141
2-gram	2,325,469	16,514	655,272
3-gram	3,802,948	175,497	2,275,668
4-gram	5,812,920	731,747	4,518,792

How the application works ...

the user opens the Shiny application
the user enters a sequence of words
the input text is cleaned and preprocessed (same as for n-gram creation)
the last (up to 3) words of the input text are extracted
based on the n-gram tables the next most likely word is predicted
if selected, further words and phrases are predicted
for incomplete last words, word completions are offered
if the user presses a completion or prediction button, the corresponding word is appended to the input text
the algoritm starts over and predicts the next word

What results were produced ...

Some Facts:

PC: Windows7x64, 32GB RAM
Preprocessing the corpus (5 million lines of text) took about 0.5 hours.
Creating the n-grams (n=1…4) took about 2 hours.

Benchmark results:

Overall top-3 score: 18.86 %
Overall top-1 precision: 14.11 %
Overall top-3 precision: 23.07 %
Average runtime: 19.89 msec
Total memory used: 246.02 MB

Open Shiny application