Wesley Engers
4/24/2016
3 sets of documents were used
a) Twitter, Blogs, News
TM package and Quanteda package used to parse and clean up the data
Data was tokenized and ngrams built
a) 1-gram, 2-gram, 3-gram, and 4-grams constructed
b) Last word was used as the “Predicted Word”
c) 5% of documents were used to reduce computation time
Prediction Algorithm is based on a weighted average of probabilities fore each of the ngrams
a) (1*1gram+2*2gram+3*3gram+4*4gram)/10
b) The 1 gram is just the most common words in the dataset
The Predicted Word with the highest probability is return to the uses as a best guess