Coursera Data Science Capstone Project: Next word prediction

Ivan Marchenko
Aug. 21 2015

Simple application based on model using Markov's chains for words sequences.

Coursera Data Science Capstone Project, in partnership with SwiftKey, aims to build a language model which can predict next word from inputted sentence.
This task was divided into seven sub tasks like data cleaning, exploratory analysis, the creation of a predictive model by using statistics, programming, data analysis knowledge which obtained in previous courses in specialisation plus natural language processing techniques.
Text data that is used to create the model comes from a 583,1 Mb corpus called HC Corpora and consists of different blogs, twitter and news snippets.
Data processing was done in R and Google BigQuery.

Whole data was cleaned in R by removing sentences with profanity, remove links, emails and user names, converted to lowercase, and all kinds of special characters and codes. This cleaned data was tokenized into word sequences of n items called n-grams.
Obtained bi-,tri- and quadgram frequency dictionaries. I use Katz's back-off model to find predictions for most of cases. The longer the sequence less low frequency words have been included. From 85% unigram coverage in quadgram to 96% in bigram. Model have up to 7000 words as possible predictions.
The data.tables and R-functions was compressed into 13.8 Mb ngrams.rdata file uploaded to shiny server. So it is a pretty small app suitable even to mobile devices.

As in building the model to predict app have to obtain, clean and tokenize data.
The algormithm will search the pattern from the 4-grams frequency matrix, and then return the Top 5 frequent predictions.However, if there is no result, it will automatically search the 3-grams… And if it still no result, it will output set of most common words.

The next word prediction app is hosted on shinyapps.io: https://chemarch.shinyapps.io/PredictNextWordApp
The app needs some time to load data and after works instantly.
The top 5 predictions are displayed as phrases, where one the most possible marked as the “Predicted phrase”.
Predictions is not perfect cose it use only context of last 2 or 3 words and tend to predict just most common words.
To improve model results i can try other smoothing algorithms extend 4-grams frequency matrix and use semantics in preprocessing data.