Coursera Data Science Capstone Project: Next word prediction

Ivan Marchenko
Aug. 21 2015

Simple application based on model using Markov's chains for words sequences.

Overview

  • Coursera Data Science Capstone Project, in partnership with SwiftKey, aims to build a language model which can predict next word from inputted sentence.
  • This task was divided into seven sub tasks like data cleaning, exploratory analysis, the creation of a predictive model by using statistics, programming, data analysis knowledge which obtained in previous courses in specialisation plus natural language processing techniques.
  • Text data that is used to create the model comes from a 583,1 Mb corpus called HC Corpora and consists of different blogs, twitter and news snippets.
  • Data processing was done in R and Google BigQuery.

The Applied Methods & Models

  • Whole data was cleaned in R by removing sentences with profanity, remove links, emails and user names, converted to lowercase, and all kinds of special characters and codes. This cleaned data was tokenized into word sequences of n items called n-grams.

  • Obtained bi-,tri- and quadgram frequency dictionaries. I use Katz's back-off model to find predictions for most of cases. The longer the sequence less low frequency words have been included. From 85% unigram coverage in quadgram to 96% in bigram. Model have up to 7000 words as possible predictions.

  • The data.tables and R-functions was compressed into 13.8 Mb ngrams.rdata file uploaded to shiny server. So it is a pretty small app suitable even to mobile devices.

App workflow

  • As in building the model to predict app have to obtain, clean and tokenize data.
  • The algormithm will search the pattern from the 4-grams frequency matrix, and then return the Top 5 frequent predictions.However, if there is no result, it will automatically search the 3-grams… And if it still no result, it will output set of most common words. model workflow

Results Explanation

  • The next word prediction app is hosted on shinyapps.io: https://chemarch.shinyapps.io/PredictNextWordApp

  • The app needs some time to load data and after works instantly.

  • The top 5 predictions are displayed as phrases, where one the most possible marked as the “Predicted phrase”.

  • Predictions is not perfect cose it use only context of last 2 or 3 words and tend to predict just most common words.

  • To improve model results i can try other smoothing algorithms extend 4-grams frequency matrix and use semantics in preprocessing data.