Next Word Prediction

    Coursera/Johns Hopkins University 
      In partnership with Swiftkey
       Data Science Specialization
            Capstone Project 

JP Van Steerteghem

March 12, 2018

Introduction

How does it work? - Building the language model.

  • A large corpus of blog (200MB), news (196MB) and twitter (159MB) data are used to train the product
  • Dataset was cleaned and punctuations, numbers, separators, English stopwords and profanity were removed.
  • ngrams are used to build this predictive language model.
    • 1-grams, 2-grams, 3-grams and 4-grams were extracted from 15% of the corpus using “quanteda”.
    • ngrams are stored in a table and sorted by frequency
  • A key element of the development effort was to optimize memory efficiency while maintaining model accuracy:
    • The size of the model was reduced by dropping N-grams with a frequency lower than 5
    • Objects were removed when not needed anymore to free up memory

How does it work? - Next word prediction

  • A word or phrase is entered in a text input box

  • Once Predict button is clicked the prediction algorithm looks for up to 3 “next word” options

  • A next word back-off algorithm is used

    • Iterates from longest N-gram (4-gram) to shortest (2-gram)
    • Predicts using the longest, most frequent, matching N-gram
    • if no match is found a “?” is returned

Resources and next steps

Next Steps:

  • refine the model using alternate smoothing algorithms:
    • Katz Back Off
    • Kneser-Ney
  • increase the size of the training data, without jeopardizing usability
  • get a more powerful compute platform

The following are excellent resources to help with next steps: