Data Science Capstone Project

A. Stefan
April 24, 2016

The purpose of this project is to develop a prediction algorithm based on a data set from a corpus called HC Corpora (www.corpora.heliohost.org).
The model must (1) accept a phrase as its input and (2) return a prediction for the most likely next word
This application makes use of principles of Natural Language Processing and Text Mining

The model was built using 1% of the original data set, the available RAM limited the size of the sample set
N-grams were constructed, with n = 1, 2, 3. The small sample size did not justify the construction of n-grams of higher order
The entries (word combinations) in each of the n-grams were assigned probabilities ($ w_i $ = i-th word)
Trigram \[ P(w_{i}|w_{i-2}w_{i-1} = \frac{count(w_{i-2}w_{i-1}w_i)}{count(w_{i-2}w_{i-1})} \]