My little homegrown tribute to SwiftKey - Coursera Data Science Capstone project

Dmitry Danilov
24 Apr 2016

The 'My little homegrown tribute to SwitfKey' application is implemented with Shiny - a web application framework for R.
The main purpose of the application is to demonstrate the ability to predict the next word by previously entered words.
The application mimics certain features of the SwiftKey for Android application.

The application not only predicts the next word by entered whole words but also predicts the word being typed by its first letters.
3 most probable predicted words are displayed as the labels of the 3 buttons located right above the input text field.
The order of appearing of the predicted words depends on their probability and is as follows: |w_p2|w_p1|w_p3| (the word with the highest probability is on the middle button, 2nd highest probability is on the left button and 3rd highest probability is on the right button)
As user types text in the input field, predicted words will appear on the 3 buttons, user has the option to press the button with the correct prediction to enter the predicted word in the input field quickly or continue typing.
Predicted words are updated instantly as user types letters in the input field.

Prediction is based on the Kneser-Ney smoothing algorithm
The main idea of the Kneser-Ney alrogithm - when predicting the probability of a word given a context, not only the current context is taken into account, but also the number of contexts that the word appears in.
The model was built from a corpus containing 3 text documents in English: en_US.blogs.txt, en_US.news.txt and en_US.twitter.txt
The total size of the model is about 58 Mb.
The application uses 4-gram prediction model.
The model consists of a vocabulary with an index mapped to each word and a set of data tables with higher order n-grams that only contain indexes of the words in the vocabulary. This has allowed to keep the size of the model and lookup operations duration to the minimum.
The model performance is estimated at around 24%, estimation was performed using a held out test set extracted from the same source files.