Data Science Capstone Text Prediction

Kristen Wedel
9-12-2017

The Data

The data used in this project comes from the website: https://web-beta.archive.org/web/20160930083655/http://www.corpora.heliohost.org/aboutcorpus.html.

Files include:

  • Newspapers
  • Blogs
  • Twitter Updates

Data Cleaning and Sampling

75,000 samples were taken from the combined data set.

Cleaning the data consisted of:

  • Converting to lower case
  • Removing numbers
  • Removing profanity
  • Replacing contractions
  • Removing punctuation

Creating the Model

The data was then converted to term document matrices with 1, 2, 3, 4 and 5 n-gram models. When a sixth was added, speed decreased.

A backoff method was then used to predict the next word. The model first tries the 5-gram model, then 4-gram, then 3-gram, then 2-gram and then 1-gram models to make predictions.

ngrams

The Application

The application is located at: https://kristywedel1.shinyapps.io/TextPred/

application

Please note: The application may take a minute to load. Future enhancements will be primarily focused on the application response time.