Coursera Data Science Capstone Project: Next Word Prediction

Sarajit Poddar
23rd August 2015

A simple application to predict the next word in a string. For more information about this project and the course click here.

Overview

  • Coursera Data Science Capstone Project, in partnership with SwiftKey, aims to build a language model which can predict next word from a input string.
  • This task was divided into seven sub tasks (1) Data loading (2) Data Cleansing (3) Exploratory analysis (4) Generating ngrams from the corpus, which are used in the predictive model (5) Developing the predictive model (6) Testing the predictive model
  • Text data that is used to create the model comes from corpus called HC Corpora that consists of different blogs, twitter and news snippets. The zipped data is downloaded from here.
  • The entire data processing, model development and testing are done in R.

Model Development

  • The data downloaded from HC corpora is cleaned by (1) converting them to lower (2) removing the punctuation (3) removing the numbers (4) removing the stopwords (5) removing extra whitespaces
  • This cleaned data was tokenized into word sequences of n items called n-grams.
  • Those aggregated uni-gram, bi-gram,tri-gram and quad-gram term frequency matrices were transferred into frequency dictionaries. The n-grams are saved as “ngram.RData”, which are loaded at the run-time when prediction routine is loaded.

Prediction

  • For prediction, the ngrams are broken down into (n-1)-grams and the next word. For instance trigram “two years ago” is broken down into bi-gram “two years” and “ago”. This is done for bigrams, trigrams and quadgrams.
  • The matching records are searched from all the n-grams and they are entered into a dataframe along with their frequencies. The scores are further multiplied by the ngram weights, to ensure that the higher level ngrams have higher impact on the outcome than the lower level ngrams.
  • The frequency is converted to scores using formula ngramsubset$score <- nsubset$freq / sum(nsubset$freq) * ngramwt
  • The final dataframe is sorted on frequencies and duplicates are summarized.

Using the Application

  • The application provides a text field for entering a text string, for which the next word will be predicted. A slider allows the user to specify the number of words to be predicted. The predicted word is shown on the main panel on the right hand side.
  • In addition, there is a ngram distribution Tab which shows the high frequency n-grams and an About Tab which provides information about this app.

Further information

  • The Next Word Prediction Shiny apps is hosted at https://sarajitpoddar.shinyapps.io/NextWordPredictor
  • The final report on the development of the Shiny Apps along with all necessary source codes can be found at http://rpubs.com/sarajitpoddar/DS_Capstone_Final_Report
  • This application is not perfect and can be more accurately modeled, especially if the entire corpora is used (I used a sample) and higher ngrams are used. But that would come at the cost of the speed. In addition, with additional programming skills the model even more optimized, but for a non-programmer such as me, I think this is a decent job.