Presentation - Predict Next Word

Scott Heffron

Project

  • Data Science Capstone Project
  • Predict the Next Word

Predict Next Word Application

Application

The text prediction application was created to predict the next word in an incomplete sentence. The data is pulled from twitter, blogs, and news feeds and was used to construct a dataset for the model. The statistical model used for this project is an N gram model

Preparing the data

The model used thousands of rows of data that were taken from each of the 3 data sets. The data was combined into one large dataset. The dataset was turned into a corpus using the tokenized function. Punctuations, numeric characters, stop words and swear words were removed. The data was then ready for the N gram model.

The Model

The model was implemented using an n gram model where N is equal to 1 through 5. The model will predict the next word of a sentence using the N grams model. The two benefits of n-gram models (and algorithms that use them) are simplicity and scalability. The model can store more context with a well-understood space-time tradeoff, enabling small experiments to scale up efficiently.

The Application