Next Word Prediction

Bhavana Shah
April 21, 2016

Overview

  • In our present times mobile devices are ubiquitous and people spending enormous amount of time for email, social networking, banking and a whole range of other activities.
  • But the typing on a small screen is difficult and error-prone.
  • Smart keyboards can alleviate this problem by predicting next possible words user may use, thereby reducing keystrokes and improving overall user experience.
  • Designing such a predictive keyboard requires data science knowledge, text analytics and Natural language processing techniques.
  • With numerous languages, different word phrasing styles and texting, building language model can be challenging.
  • The goal of the capstone project was to design a Shiny Application that can take a word, phrase, or sentence as input and output the next possible word.
  • The data was obtained from HC Corpora Site. The details on corpora are available here
  • The data uses three corpus documents from sources such as Blogs, Tweets and News. The files used for the project are from english locale: en_US.blogs.txt, en_US.news.txt, en_US.twitter.txt

Application & Features

  • The Next Word Prediction application is built using Shiny apps. It lets user enter a phrase and the application returns prediction of the next word(s)
  • Input phrase to have atleast two words. Outputs are predicted words as radio button options.
  • Selecting a word in radio options list, adds it to the input phrase, or you can ignore and continue typing. Subsequent word prediction can be continued in the repeated manner.
  • If prediction is not successful, user is given a feedback.
  • The Next Word Prediction is available at: https://bhavanashah.shinyapps.io/CapstoneProject/
  • To build and train language model, first a sample was taken and then pre-processing was performed. Pre-processing involved cleaning of text of unwanted characters, removal of punctuation and numbers, whitespaces etc.
  • The tokenized text was then used to create n-grams:(1-gram, 2-gram, and 3-gram)
  • Each table was sorted, with highest frequency at the top. Then frequency of frequency table was created for 2-gram and 3-gram, used for Simple Good-Turing estimator.

Algorithm

  • The Next Word Prediction application uses Simple Good-Turing (SGT) estimator, devised by late William A. Gale and Geoffrey Sampson[1] in 1995.
  • SGT estimator deals with frequencies of frequencies of events and designed to smooth a probability distribution in such a way that it accounts reasonably for events that have not occurred.
  • This technique was chosen for the project because it is straightforward and not as much complex and computationally extensive.
  • Algorithm tab panel of the application describes the method in detail.



[1] Good-Turing Frequency Estimation Without Tears (JOURNAL OF QUANTITATIVE LINGUISTICS, vol. 2, pp. 217-37 -- reprinted in Geoffrey Sampson, EMPIRICAL LINGUISTICS, Continuum, 2001). Website

Future Enhancements

  • Strength of application is that SGT is a simpler, yet far more accurate than additive techniques. Application does run relatively fast.
  • However, loss of context is apparent after repeated predictions on the phrase/sentence. This would require advanced programming as future improvement.
  • At present, the app does not 'learn' from user input, which could be incorporated by saving history of frequent user input.
  • Also, the prediction app could be made more accurate by techniques such as: parts of speech, continuous bag of words (CBOW) and skip-gram.