Word prediction app

Sigve Nakken
January 2016

Brief presentation of a predictive text model and description of Shiny application

Project background - predictive text modeling

  • highly relevant for development of word suggestions when typing on mobile devices
  • we will use a large training data set of natural english language to fit a predictive text model for English text
  • an interactive web application has been developed to demonstrate its utility
  • the word prediction application allows users to type in any given phrase and retrieve word suggestions

Model description - I

  • n-gram model: conditions the probability of the next word on the preceding n-1 words
  • bigram model considers only the preceding word, trigram model considers the two preceding words etc.
  • the probability of a sequence of words is estimated as the number of times the sequence occurred divided by the number of times its preceding context occurred in a training set, i.e. p(“president barack obama”) = N(“president barack obama”)/N(“president barack”)

Model description - II

  • our model combines unigram, bigram, trigram, and quadgram models.
  • n-gram models with larger n are generally more accurate, but has the disadvantage that it needs extensive training data
  • we have implemented the katz back-off model, which switches to lower-order n-gram models as needed

Model training

  • we have trained our n-gram models using a large set of text collections from news, Twitter and blogs.
  • all text fitted to the n-gram models were subject to extensive cleaning:
    • sentence splitting
    • removal of excessive whitespace
    • removal of numbers
    • removal of profanity words

Web application

  • A Shiny web application has been built to demostrate our predictive text model
  • User can type in a given text in the input field in the left panel, a table with a ranked list of suggestions will appear in the main panel
  • Visit http://sigven78.shinyapps.io/webapp/ and try it out!