Ivan Marchenko
Aug. 21 2015
Simple application based on model using Markov's chains for words sequences.
Whole data was cleaned in R by removing sentences with profanity, remove links, emails and user names, converted to lowercase, and all kinds of special characters and codes. This cleaned data was tokenized into word sequences of n items called n-grams.
Obtained bi-,tri- and quadgram frequency dictionaries. I use Katz's back-off model to find predictions for most of cases. The longer the sequence less low frequency words have been included. From 85% unigram coverage in quadgram to 96% in bigram. Model have up to 7000 words as possible predictions.
The data.tables and R-functions was compressed into 13.8 Mb ngrams.rdata file uploaded to shiny server. So it is a pretty small app suitable even to mobile devices.

The next word prediction app is hosted on shinyapps.io: https://chemarch.shinyapps.io/PredictNextWordApp
The app needs some time to load data and after works instantly.
The top 5 predictions are displayed as phrases, where one the most possible marked as the “Predicted phrase”.
Predictions is not perfect cose it use only context of last 2 or 3 words and tend to predict just most common words.
To improve model results i can try other smoothing algorithms extend 4-grams frequency matrix and use semantics in preprocessing data.