Scott Heffron
The text prediction application was created to predict the next word in an incomplete sentence. The data is pulled from twitter, blogs, and news feeds and was used to construct a dataset for the model. The statistical model used for this project is an N gram model
The model used thousands of rows of data that were taken from each of the 3 data sets. The data was combined into one large dataset. The dataset was turned into a corpus using the tokenized function. Punctuations, numeric characters, stop words and swear words were removed. The data was then ready for the N gram model.
The model was implemented using an n gram model where N is equal to 1 through 5. The model will predict the next word of a sentence using the N grams model. The two benefits of n-gram models (and algorithms that use them) are simplicity and scalability. The model can store more context with a well-understood space-time tradeoff, enabling small experiments to scale up efficiently.