Data Science Capstone

Johannes (Vanni) de Clippele
November 2018

A word prediction application is presented here, how it was developed and how to use it. It is the Capstone project for the DataScience specialisation course on Coursera, from the John Hopkins University.Swiftkey provided the data for this project.

Introduction

SwiftKey uses a blend of artificial intelligence technologies to predict the next word the user intends to type. The goal of this project is to develop an application which does just that, using the skills learned during the specialisation on a brand new type of data and of analysis : natural language processing.

Having tested the assumption that cutting up the data in 5 disjoint chunks and dropping rare occurences in each chunk doesn't significantly incurr information loss, I was able to process 40% of a huge dataset (~ 70 million words) of tweets, blogs and newsitems into library of 5 small tables containing the necessary information (N-grams and their counts).

The predictive model is based upon this preprocessed library, and has been wrapped up in a responsive app…

The algorithm : a 5-gram back-off model

Predicting the next word from a given phrase involves looking up the given phrase in our library, and presenting the “next” words, scored by their observed frequencies, relative to frequency of the phrase. For example, looking up “i love” predicts “you” (2/3) and “movies” (1/3), from the library below :

i love you i love movies i love you cats love milk

Our library accepts at most the last 4 words. When those 4 words are not in our library, the model “backs-off” to a lower level, taking only the 3 last words into account, and so forth. In fact, words are scored by combining the scores for the different levels, applying a back-off factor of 0.4 for each level lowered. This is called the “stupid” back-off model, which gives surprisingly good results for larger datasets.

Performance

To test our prediction model, the capacity to correctly predict the last word of a validation set of (20.000) unseen 5-grams was measured.

  • 40,53 % were predicted correctly within the top 10 guesses
  • 27,69 % were predicted correctly within the top 3 guesses
  • 16,93 % were predicted correctly with the top guess

Responsiveness was about 16 milliseconds (25th,50th and 75th quantiles) elapsed time on my (old) pc.

The results in terms of accuracy are fairly good and generally above those found on the Coursera fora, whereas the responsiveness seems below par. However, the shiny app on the web does seem to react nearly instantly.

The application

Enter some text, and the predicted the next word will be displayed. Prediction happens in one of two modes , depending on the last character:

  • after a space, the next word is a new word
  • otherwise, the next word completes the word being written

A context of up to 4 preceding words will be taken into account.

In both cases the next word is presented as a word cloud, with the size of the word indicating its probability.

Try it out here !