Data Science Capstone Text Prediction

Anshuman D Vyas
April 26, 2015

Introduction

Around the world, people are spending an increasing amount of time on their mobile devices for email, social networking, banking and a whole range of other activities. But typing on mobile devices can be a serious pain.

Predictive text is an input technology that facilitates typing on a mobile device by suggesting words the end user may wish to insert in a text field. Because the end user simply taps on a word instead of typing it it out on a soft keyboard, predictive text can significantly speed up the input process.

The goal of this project is to build a predictive text application, which takes a phrase with one or more words as input and predicts the next 3 most likely words as output.

Data Preprocessing And Model Development

  • The training set was built by randomly sampling 300,000 lines from tweets, news and blogs each.
  • Basic preprocessing steps such as removing punctuations, changing to lower case, removing numbers and stripping whitespaces were done.
  • The data was tokenized into One Gram Tokens, Two Gram Tokens, Three Gram Tokens and Four Gram Tokens.
  • These tokens were then used to create their respective lookup data tables.
  • Finally the data tables were sorted using the “setkey”“ function for faster lookup.

Prediction Algorithm

  • Initially, with no user input, the application predicts the 3 most likely words used to start a sentence such as “The” , “I” , “Yes”.
  • Once the user enters a phrase, based on the number of words, it looks up the respective data table and returns the 3 most likely words.
  • Linear interpolation is then performed on these 3 words for further accuracy.
  • Stupid Backoff was also implemented. So for example, if the 4 Word Table does not return any result, the 3 Word Table is looked up, and so on.
  • Situations where the algorithm returns no predictions, will most likely be a case where the input was a noun. Thus, the application will return predictions such as “is”, “and”, “are”.

How To Use The App

  • Enter a phrase with one or more words in the text box provided, and then click Submit.
  • The 3 most likely words will be shown on the right.
  • Everytime, the input is changed/modified, one needs to click Submit or hit Enter for the predictions to be updated.

alt text