Michelle Jaeger
8-23-2015
To use the app:
Three files were provided, containing data from twitter, blogs, and news articles. Each file contained about 35 million words. The prediction model was created using about 10% of this data.
The data was cleansed by removing profanity, punctuation (though apostrophes were kept), and numbers. All words were converted to lowercase.
The data was then grouped into combinations of two, three, and four words. This process is referred to as “tokenization” in the field of natural language processing, and the groupings are generally called n-grams.
A data table was then created for each n-gram length, with columns for the beginning of the phrase, the possible next word, and the frequency of occurence