Bernard NK
August 20, 2015
The project is done in association with SwiftKey, a company developing a smart prediction technology for easier mobile typing. To predict the next word, this R algorithm was implemented:
Get a corpus and identify appropriate tokens such as words, punctuation, and numbers.
Build a model with the corpus to understand the distribution and relationship between the words, tokens, and phrases.
The prediction algorithm is based on a predictor variable that is the n-gram frequency, to determine the next word that a user is most likely to type.
Match a n-gram character string with the appropriate n+1 gram entry in the n-gram frequency table.
How to use the predictive application:
The data is from a corpus called HC Corpora (www.corpora.heliohost.org). It is composed of a large number of tweets, blogs and news publications. We used this corpus to identify appropriate tokens such as words, punctuation, and numbers. This dataset is used in the Shiny R application.
This application could be extended for other language processing predictions, including:
References: