Next Word Prediction

Bhavana Shah
April 21, 2016

Overview

In our present times mobile devices are ubiquitous and people spending enormous amount of time for email, social networking, banking and a whole range of other activities.
But the typing on a small screen is difficult and error-prone.
Smart keyboards can alleviate this problem by predicting next possible words user may use, thereby reducing keystrokes and improving overall user experience.
Designing such a predictive keyboard requires data science knowledge, text analytics and Natural language processing techniques.
With numerous languages, different word phrasing styles and texting, building language model can be challenging.
The goal of the capstone project was to design a Shiny Application that can take a word, phrase, or sentence as input and output the next possible word.
The data was obtained from HC Corpora Site. The details on corpora are available here
The data uses three corpus documents from sources such as Blogs, Tweets and News. The files used for the project are from english locale: en_US.blogs.txt, en_US.news.txt, en_US.twitter.txt

Application & Features

The Next Word Prediction application is built using Shiny apps. It lets user enter a phrase and the application returns prediction of the next word(s)
Input phrase to have atleast two words. Outputs are predicted words as radio button options.
Selecting a word in radio options list, adds it to the input phrase, or you can ignore and continue typing. Subsequent word prediction can be continued in the repeated manner.
If prediction is not successful, user is given a feedback.
The Next Word Prediction is available at: https://bhavanashah.shinyapps.io/CapstoneProject/
To build and train language model, first a sample was taken and then pre-processing was performed. Pre-processing involved cleaning of text of unwanted characters, removal of punctuation and numbers, whitespaces etc.
The tokenized text was then used to create n-grams:(1-gram, 2-gram, and 3-gram)
Each table was sorted, with highest frequency at the top. Then frequency of frequency table was created for 2-gram and 3-gram, used for Simple Good-Turing estimator.

Algorithm

The Next Word Prediction application uses Simple Good-Turing (SGT) estimator, devised by late William A. Gale and Geoffrey Sampson^[1] in 1995.
SGT estimator deals with frequencies of frequencies of events and designed to smooth a probability distribution in such a way that it accounts reasonably for events that have not occurred.
This technique was chosen for the project because it is straightforward and not as much complex and computationally extensive.
Algorithm tab panel of the application describes the method in detail.

[1] Good-Turing Frequency Estimation Without Tears (JOURNAL OF QUANTITATIVE LINGUISTICS, vol. 2, pp. 217-37 -- reprinted in Geoffrey Sampson, EMPIRICAL LINGUISTICS, Continuum, 2001). Website

Future Enhancements

Strength of application is that SGT is a simpler, yet far more accurate than additive techniques. Application does run relatively fast.
However, loss of context is apparent after repeated predictions on the phrase/sentence. This would require advanced programming as future improvement.
At present, the app does not 'learn' from user input, which could be incorporated by saving history of frequent user input.
Also, the prediction app could be made more accurate by techniques such as: parts of speech, continuous bag of words (CBOW) and skip-gram.