Brief presentation of a predictive text model and description of Shiny application
Project background - predictive text modeling
highly relevant for development of word suggestions when typing on mobile devices
we will use a large training data set of natural english language to fit a predictive text model for English text
an interactive web application has been developed to demonstrate its utility
the word prediction application allows users to type in any given phrase and retrieve word suggestions
Model description - I
n-gram model: conditions the probability of the next word on the preceding n-1 words
bigram model considers only the preceding word, trigram model considers the two preceding words etc.
the probability of a sequence of words is estimated as the number of times the sequence occurred divided by the number of times its preceding context occurred in a training set, i.e. p(“president barack obama”) = N(“president barack obama”)/N(“president barack”)
Model description - II
our model combines unigram, bigram, trigram, and quadgram models.
n-gram models with larger n are generally more accurate, but has the disadvantage that it needs
extensive training data
we have implemented the katz back-off model, which switches to lower-order n-gram models as needed
Model training
we have trained our n-gram models using a large set of text collections from news, Twitter and blogs.
all text fitted to the n-gram models were subject to extensive cleaning:
sentence splitting
removal of excessive whitespace
removal of numbers
removal of profanity words
Web application
A Shiny web application has been built to demostrate our predictive text model
User can type in a given text in the input field in the left panel, a table with a ranked list of
suggestions will appear in the main panel