TextNext

Ian Reeve
13th December 2014

TextNext is a very simple demonstration of how big data can be exploited to predict, what might appear to be, a highly unpredictable variable with some precision. In this case, that variable is the next word in a sequence of four or fewer words

https://ianreeve.shinyapps.io/TextPredictor/

There are four text boxes, the app though parses what is entered so you can input as much as text as you want and it will just look at the last four words. It is reactive so suggestions will be made as you enter.

What is TextNext?

The expected accuracy of correctly predicting the fifth word after the preceding four is around 17%. Indeed the accuracies for the predicting a third word and a second wordare very similar. There is only a very significant deterioration when the algorithm only has a single word to go on. But even if this instance it is around 13%.

The app was developed over six weeks and sits on the ShinyApps.io platform. It addresses the problem posed for the capstone project for the Coursera Data Science Specialization. Thanks go to professors Roger Peng, Jeff Leek and Brian Caffo at JHU who have taught the specialization and to Swiftkey, who commerically develop predictive text applications for smartphones.

The source of the text corpora on which the work is based is http://www.corpora.heliohost.org/ considering of text collected by a web crawler from online news articles, blogs and Tweets. I have used the English language set.

How Was TextNext built?

  • Read in the full corpora and partition into sets for training, testing and validation. The 60% allocated to the training was used to develop the tokenizer and an initial set of n-grams. Developed a tokenizer to handle punctuation and all the oddities in the data. Split the data into sentences and then into word chains or n-grams (sequences of 1 to 4 ords followed by one further word).

  • Any n-grams containing a word with fewer than 10 occurences in the entire training set was removed. N-grams predicting “swear words” were also removed. Where any n-gram gave rise to more than one prediction, the prediction with the greatest frequency for that n-gram was chosen. If there was a tie, the frequency of the single predicted word in the entire training set was used to decide. This generated in total over 290 million n-grams.

How Was TextNext built? (cont'd)

  • The intended design was that a user would enter upto four words and the app would predict the next. Assuming the entry was four words, the algorithm would look for a matching 4-gram and return the associated predicted word. If there was no match, the first word would be disregarded and the 3-gram dataset checked and so on. If at the end, there was no match, the word and would be returned (nb the is more common word due to staring sentences). Entries with fewer than four words work in the same way but commence the reductive process at a later stage.

  • Using the 290 million n-grams, this approach was applied to half of the testing set. A random n-gram was selected from each sentence. This created 869,146 4-grams to test, achieving success of 19.7%. The process was repeated for 3,2 and 1-grams. In total 4.1 million tests were performed

How Was TextNext built? (cont'd)

  • The 290 million n-gram set was now drastically filtered to remove any n-grams that hadn't been matched in this first exercise. It was whittled down to 1.25 million n-grams. A second test was then carried out using this smaller set and the other half of the testing set data that had thus remained untouched. The success rate for 4-grams reduced from the 19.7% to 16.8%.The total vocabulary shrunk from 82,399 to 69,068 and the n-grams were re-indexed.