Word prediction app

Sigve Nakken
January 2016

Brief presentation of a predictive text model and description of Shiny application

Project background - predictive text modeling

highly relevant for development of word suggestions when typing on mobile devices
we will use a large training data set of natural english language to fit a predictive text model for English text
an interactive web application has been developed to demonstrate its utility
the word prediction application allows users to type in any given phrase and retrieve word suggestions

Model description - I

n-gram model: conditions the probability of the next word on the preceding n-1 words
bigram model considers only the preceding word, trigram model considers the two preceding words etc.
the probability of a sequence of words is estimated as the number of times the sequence occurred divided by the number of times its preceding context occurred in a training set, i.e. p(“president barack obama”) = N(“president barack obama”)/N(“president barack”)

Model description - II

our model combines unigram, bigram, trigram, and quadgram models.
n-gram models with larger n are generally more accurate, but has the disadvantage that it needs extensive training data
we have implemented the katz back-off model, which switches to lower-order n-gram models as needed

Model training

we have trained our n-gram models using a large set of text collections from news, Twitter and blogs.
all text fitted to the n-gram models were subject to extensive cleaning:
- sentence splitting
- removal of excessive whitespace
- removal of numbers
- removal of profanity words

Web application

A Shiny web application has been built to demostrate our predictive text model
User can type in a given text in the input field in the left panel, a table with a ranked list of suggestions will appear in the main panel
Visit http://sigven78.shinyapps.io/webapp/ and try it out!