Data Science Capstone Project

Michelle Jaeger
8-23-2015

The App

  • A text prediction application made with Shiny, a web application framework for R
  • Predicts the most likely next word after the user enters a phrase or word into the text box
  • Located at: http://michellej.shinyapps.io/shiny_app

To use the app:

  • Enter a word or phrase in the text box

The Data

Three files were provided, containing data from twitter, blogs, and news articles. Each file contained about 35 million words. The prediction model was created using about 10% of this data.

plot of chunk unnamed-chunk-2

Data Cleansing and Tokenization

  • The data was cleansed by removing profanity, punctuation (though apostrophes were kept), and numbers. All words were converted to lowercase.

  • The data was then grouped into combinations of two, three, and four words. This process is referred to as “tokenization” in the field of natural language processing, and the groupings are generally called n-grams.

  • A data table was then created for each n-gram length, with columns for the beginning of the phrase, the possible next word, and the frequency of occurence

Lastly: The Prediction Model

  • The aforementioned n-gram data tables were reduced by only keeping rows with a “next word” pertaining to the highest frequency count for each beginning phrase. Phrase/next word combinations with a frequency count of one were also removed, resulting in an efficient and light-weight model.
  • The model chooses a prediction by looking up the next word for the entered phrase in the data table.
  • When a phrase is entered by the user, tables are accessed in descending order to find a prediction. For example, if a next word prediction for “he didn't think” is not found in the four-gram table, the tri-gram table will be checked for a next word prediction for “didn't think”, and so on.
  • If no prediction is found, the model returns “and” as the prediction