X. SHEN
December 12 2014
The text data is from a corpus called HC Corpora. Three text data files can be downloaded from Coursera Data Sicence Capstone class website.
The three data files are:
After processing the data, an app is created using the Markov-chain language models (N-gram models). This app predicts the most probable word following a sequence of words entered by a user.
Please see the milestone report here for more details.
The word predicting app is created by the n-gram models. An n-gram model is a type of probabilistic language model for predicting the next item in such a sequence in the form of a \( (n - 1) \)-order Markov model.
The probability of a word is conditioned on some number of previous words (one word in a bigram model, two words in a trigram model, etc.). The conditional probability can be calculated from n-gram frequency counts:
To save computation time, only 5% of the data from the data-set is used in this application.