Data Science Capstone Project: Word Prediction

Juan Luis Herrera Cortijo
25 April 2015

Word Prediction

The App

Word Prediction is a Shiny app that predicts the next word as you write. Word Prediction is:

Dinamic: the app updates its prediction with every keystroke.
Has multiple options: at all moments the app offers you its best 3 guesses.
Easy to use: just write and then click the words as they appear or press enter to use the best guess. It even adds a space for you when you select a word.
Sentence aware: It offers you a prediction even if you don't have written any character or just typed a period.
Profanity aware: profane words are masked until they are selected.

How it works

Word Prediction brings to your desktop the experience that SwiftKey provides to mobile devides. While you write, it monitorizes your progress and presents you with 3 words that you can choose at any moment, either by clicking on a “Select” buttom or hitting the enter key to use the app's best guess.

If offers two kind of predictions:

While you type a word it uses the characters already typed to constrain its guesses.
After you type an space, Word Prediction knows that you are looking for an entire new word.

The algorithm

Behind the scenes, Word Prediction runs a prediction algorith based in a trigram language model. The main features of this algorithm are:

Unigrams, bigrams and trigrams frequencies computed using a set of +6.5 million sentences from Twitter, blogs and news.
Katz Back-Off bigram and trigram model together with discounted unigram model that compute next words probabilities.
Number of candidates per prediction dynamically constrained to improve performance.
Profane words included in model for better accuracy, but masked at the moment of prediction.

The dataset

At the heart of the app lives its language model. The corpus used to build the model was provided by the Coursera's staff and is and english set of twitter, blog posts and news from HC Corpopora. All the posts were put together and splitted into sentences to create a mixed corpus. Each sentence in the corpus was:

White stripped.
Lowercased.
Puntuation and numbers removed.
Tokenized into 1,2 and 3-grams using RWeka n-gram tokenizer.
Added start and endo of the sentence symbols to model the beginning and end of sentences.