Next Word Prediction Model

Ronen Cohen
April 2015

A next word prediction model suggests relevant words based on context as the user type in, it is especially helpful with smart phones and other small devices, were the typing experience is very cumbersome and could benefit from less typing and more selection of relevant words.

Data Preparation

Combined text corpus of over 4 million lines from sources like Twitter, News, and Blogs.
Randomly sampling about 7% of each text source, to build the language model.
Split paragraphs into sentences.
Trim excess whitespace.
Data cleansing, keep only Latin alphanumerics.
Change all letters to lower case
Garbage data will naturally disregarded due to low frequency.

N-Gram Model

Conditions the probability of the next word based on the previous 'W-1' words (context).
- Phrase: 'I Love You'
- Context: 'I Love'
- Next Word: 'You'
The length of the context depends on 'W'.
- 2-gram model considers the previous word only.
- 3-gram model considers the previous 2 words.
Probability of a phrase is the number of times the phrase occurred divided by the number of times the context occurred.
- p(I Love You) = count (I Love You) / count (I Love)

Backoff Model

Helps to condition on less context for contexts you haven't learned much about.
The Backoff language model combines a unigram, bigram, and trigram model.
Katz Backoff is a generative n-gram language model that estimates the conditional probability of a word given its history in the n-grams.
It accomplishes this estimation by “backing-off” to models with lower order n-grams.

Application

The application provides a text box to type a pharse for which the user would like to predict the next word.
The application shows the top 5 suggestions in a wordcloud visual form.
The user can use the slider to restrict the number of next words displayed.
Please note first time load of the app can take up to 15 seconds.