Capstone Project: Word Prediction App

X. SHEN
December 12 2014

Overview

The text data is from a corpus called HC Corpora. Three text data files can be downloaded from Coursera Data Sicence Capstone class website.

The three data files are:

  • en_US.blogs.txt
  • en_US.news.txt
  • en_US.twitter.txt

After processing the data, an app is created using the Markov-chain language models (N-gram models). This app predicts the most probable word following a sequence of words entered by a user.

Data Processing

  • Read the raw data from the text files
  • Create and transform the Corpus, i.e. removing numbers & punctuations, stripping white space and changing letters to lower case
  • Cleanup unwanted characters
  • Build 5-gram frequency matrices

Please see the milestone report here for more details.

Model Building

The word predicting app is created by the n-gram models. An n-gram model is a type of probabilistic language model for predicting the next item in such a sequence in the form of a \( (n - 1) \)-order Markov model.

The probability of a word is conditioned on some number of previous words (one word in a bigram model, two words in a trigram model, etc.). The conditional probability can be calculated from n-gram frequency counts:

\( P(w_i|w_{i-(n-1)},\ldots,w_{i-1}) = \frac{count(w_{i-(n-1)},\ldots,w_{i-1},w_i)}{count(w_{i-(n-1)},\ldots,w_{i-1})} \)

To save computation time, only 5% of the data from the data-set is used in this application.

Shiny App

alt text
The app can be accessed here. Here are the steps for using the app:

  • In the left-hand sidebar panel, insert the partial sentence you would like to analyze
  • Press the submit button and wait for the next word to appear on the main panel